Antimatroid, The

thoughts on computer science, electronics, mathematics

Deep Learning for Automatic Speech Recognition

leave a comment »


The problem of automatic speech recognition, and details of the traditional Hidden Markov Model and Gaussian Mixture Model hybrid architecture (HMM-GMM) for acoustic modeling are detailed in [JM08], but will be skipped here. Instead, the focus of this literature review is to discuss how [DYDA12] uses a context dependent Hidden Markov Model and Deep Neural Network hybrid architecture (CD-HMM-GMM) for acoustic modeling as it represents a significant improvement over the traditional HMM-GMM approach. This review will begin with motivation for the architecture, then go into detail the algorithms used for pre-training, and outline the algorithms used for training before concluding with how well the approach outperforms the standard HMM-GMM approach.


To motivate their architecture, [DYDA12] rely on the standard noisy channel model for speech recognition presented in [JM08] where we wish to maximize the likelihood of a decoded word sequence given our input audio observations:

\displaystyle \hat{w} = \underset{w \in L}{\text{argmax }} \mathbb{P} \left( w \lvert x \right ) = \underset{w \in L}{\text{argmax }} \mathbb{P} \left( x \lvert w \right ) \mathbb{P} \left(  w  \right ) (1)

Where \mathbb{P} \left( w \right ) and \mathbb{P} \left( x \lvert w \right ) represent the language and acoustic models respectively. [JM08] state that the language model can be computed via an N-gram model; [DYDA12] acknowledge using this approach, but focus their efforts into explaining their acoustic model:

\displaystyle \mathbb{P} \left( x \lvert w \right ) = \sum_{q}  \mathbb{P} \left( x, q \lvert w \right ) \mathbb{P} \left( q \lvert w \right ) \approxeq \max \pi(q_0) \prod_{t = 1}^T a_{q_{t-1} q_t} \prod_{t=0}^T \mathbb{P} \left( x_t \lvert q_t \right ) (2)

Here the acoustic model is viewed as a sequence of transitions between states of tied-state triphones which [DYDA12] refer to as senones giving us the context dependent aspect of the architecture. [FLMS14] explains that senones represent the pronunciation of words and are derived by decision trees. By tying triphone states together, this approach is able to avoid having to process a large number of triphones and avoid the likely sparseness of training examples for every possible triphone.

The model assumes that there is a probability \pi(q_0) for the starting state, probabilities a_{q_{t-1} q_{t}} of transitioning to the state observed at step t -1 to step t, and finally, the probability of the acoustics given the current state q_t. [DYDA12] expand this last term further into:

\displaystyle \mathbb{P} \left( x_t \lvert q_t \right ) = \frac{\mathbb{P} \left( q_t \lvert x_t \right ) \mathbb{P} \left( x_t \right ) }{\mathbb{P} \left( q_t \right ) } (3)

Where \mathbb{P} \left( x_t \lvert q_t \right ) models the tied triphone senone posterior given mel-frequency cepstral coefficients (MFCCs) based on 11 sampled frames of audio. While MFCCs come from signal processing, they have proven to be effective features for automatic speech recognition. Based on the power spectrum derived from sample audio frames, MFCCs represent characteristics of the audio that our ears are sensitive to as explained in [Ada10]. \mathbb{P} \left( q_t \right ) is the prior probability of the senone, and \mathbb{P} \left( x_t \right ) can be ignored since it does not vary based on the decoded word sequence we are trying to find.

Based on this formalism, [DYDA12] chose to use a pre-trained Deep Neural Network to estimate \mathbb{P} \left( q_t \lvert x_t \right ) using MFCCs as DNN inputs and taking the senone posterior probabilities as DNN outputs. The transitioning between events is best modeled by a Hidden Markov Model whose notation, \pi, a, \text{and } q appears in Eq. (2). Now that we have an overview of the general CD-DNN-HMM architecture, we can look at how [DYDA12] train their model.


Given the DNN model we wish to fit the parameters of the model to a training set. This is usually accomplished by minimizing a likelihood function and deploying a gradient descent procedure to update the weights. One complication to this approach is that the likelihood can be computationally expensive for multilayer networks with many nodes rendering the approach unusable. As an alternative, one can attempt to optimize a computationally tractable surrogate to the likelihood. In this case the surrogate is the contrastive divergence method developed by [Hin02]. This sidestep enabled [HOT06] to develop an efficient unsupervised greedy pre-training process whose results can then be refined using a few iterations of the traditional supervised backpropagation approach. In this portion of the paper we discuss the work of [Hin02] and explain the greedy algorithm of [HOT06] before going on to discuss the high-level training procedure of [DYDA12].

To understand the pre-training process, it is necessary to discuss the Restricted Boltzmann Machine (RBM) and Deep Belief Network (DBN) models. RBMs are an undirected bipartite graphical model with Gaussian distributed input nodes in a visible layer connecting to binary nodes in a hidden layer. Every possible arrangement of hidden, h, and visible, v, nodes is given an energy under the RBM model:

\displaystyle E(v, h) = - b^T v - c^T h - v^T W h (4)

Where W is the weight of connections between nodes and vectors b and c correspond to the visible and hidden biases respectively. The resulting probability is then given by:

\displaystyle \mathbb{P} \left( v, h \right ) = \frac{e^{-E(v, h)}}{Z} (5)

Where Z is a normalization factor. Based on the assumptions of the RBM, [DYDA12] derive expressions for \mathbb{P} \left( h = 1 \lvert v \right ) and \mathbb{P} \left( v = 1 \lvert h \right ) given by:

\displaystyle \mathbb{P} \left( h = 1 \lvert v \right ) = \sigma(c + v^T W) \qquad \mathbb{P} \left( v = 1 \lvert h \right ) = \sigma(b + h^T W^T) (6)

Where \sigma is an element-wise logistic function. [DYDA12] argue that Eq. (6) allows one to repurpose the RBM parameters to initialize a neural network. Training of the RBM is done by stochastic gradient descent against the negative log likelihood since we wish to find a stable energy configuration for the model:

\displaystyle - \frac{\partial \ell(\theta)}{\partial w_{ij}} = \langle v_i h_j \rangle_\text{data} - \langle v_i h_j \rangle_\text{model} (7)

however [DYDA12] point out that the gradient of the negative log likelihood cannot be computed exactly since the \langle \cdot \rangle_\text{model} term takes exponential time. As a result, the contrastive divergence method is used to approximate the derivative:

\displaystyle - \frac{\partial \ell(\theta)}{\partial w_{ij}} = \langle v_i h_j \rangle_\text{data} - \langle v_i h_j \rangle_\text{1} (8)

where \langle \cdot \rangle_\text{1} is a single step Gibbs sampled expectation. These terms are expectations in which nodes i \text{ and } j are simultaneously active given the training data and model. Given this insight, regular stochastic gradient descent can be performed and the parameters of a RBM fitted to training data.

Now that we have an understanding of RBMs, we can shift our focus to DBNs. A Deep Belief Network is a multilayer model with undirected connections between the top two layers and directed between other layers. To train these models, [HOT06] had the insight to treat adjacent layers of nodes as RBMs. One starts with the bottom two layers and trains them as though they were a single RBM. Once those two layers are trained, then the top layer of the RBM is treated as the input layer of a new RBM with the layer above that layer acting as the hidden layer of the new RBM. The sliding window over the layers continues until the full DBN is trained. After this, [HOT06] describe an “up-down” algorithm to further refine the learned weights. The learned parameters of this greedy approach can then be used as the parameters of a DNN as explained earlier in the discussion of Eq. (6).


Training of the CD-DNN-HMM model consists of roughly a dozen involved steps. We won’t elaborate here on the full details of each step, but will instead provide a high-level sketch of the procedure to convey its general mechanics.

The first high-level step of the procedure is to initialize the CD-DNN-HMM model. This is done by first training a decision tree to find the best tying of triphone states which are then used to train a CD-GMM-HMM system. Next, the unique tied state triphones are each assigned a unique senone identifier. This mapping will then be used to label each of the tied state triphones. (These identifiers will be used later to refine the DNN.) Finally, the trained CD-GMM-HMM is converted into a CD-DNN-HMM by retaining the triphone and senone structure and HMM parameters. This resulting DNN goes through the previously discussed pre-training procedure.

The next high-level step iteratively refines the CD-DNN-HMM. To do this, first the originally trained CD-GMM-HMM model is used to generate a raw alignment of states which is then mapped to its corresponding senone identifier. This resulting alignment is then used to refine the DBN by backpropagation. Next, the prior senone probability is estimated based on the number of frames paired with the senone and the total number of frames. These estimates are then used to refine the HMM transition probabilities to maximize the features. Finally, if this newly estimated parameters do not improve accuracy against a development set, then the training procedure terminates; otherwise, the procedure repeats this high-level step.

Experimental Results

System Configurations

[DYDA12] report that their system relies on nationwide language model consisting of 1.5 million trigrams. For their acoustic model, they use a five hidden layer DNN with each layer containing 2,048 hidden units. Training the system from scratch on 24 hours of training data takes four days on a Dell T3500 workstation with an NVIDIA Tesla GPU. [DYDA12] emphasize the importance of the GPU in obtaining acceptable training time, and that without it, training time would be 30x slower.

Datasets and Metrics

Comparison of automatic speech recognition system consists of three principle error metrics: sentence (SER), word (WER), and phoneme (PER) error rates. These look at the ratio of incorrect entities to the number of total entities with the exception of word error rate which uses a Levenshtein approach to measure the number of insertions, substitutions, and deletions relative to the total number of words. A sentence is considered incorrect if there is at least one incorrect word.

These error metrics often coincide with different datasets, in particular WER is reported for Switchboard, SER for Bing Mobile Voice Search (BMVS), and PER on TIMIT. Switchboard is a collection of phone conversations between two people, while BMVS is a collection of short spoken questions such as “The Med” or “Chautauqua Park” that are used to find these locations, while TIMIT is a phonetic focused corpus of spoken sentences that are phonetically rich.


  Switchboard BMVS TIMIT
  (WER) (SER) (PER)
GMM 23.6[2] 36.2[1] 21.7[2]
DNN 16.1[2] 30.4[1] 21.9[3]
CNN 20.2[3]
RNN 17.7[4]
Comparison of different architectures on different datasets and their corresponding datasets as reported from the following sources: [1] [DYDA12], [2] [DBL12], [3] [AMJ+14], [4] [GMH13].

Direct comparison of models is complicated by the variety of error metrics and datasets; [DBL12] is used to fill in these gaps to make a meaningful comparison. As one can see from Table (1), the neural network approaches do better on average over the traditional GMM approach. To illustrate that it is not only DNN approaches that do better, the work of [AMJ+14] using a Convolutional Neural Network (CNN) and [GMH13] using a Recurrent Neural Network (RNN) are included to further drive the point that neural network architectures are viable alternatives to GMMs.


[DYDA12], [AMJ+14], and [GMH13] have shown that neural network architectures exhibit better performance over Gaussian Mixture Models. [DYDA12] believes that a more capable first layer model provided by mean-covariance restricted Boltzmann machines will increase performance, while [AMJ+14] plans to investigate unexpected improvements in large-vocabulary speech recognition where they were absent in phone recognition tasks when using convolutional restricted Boltzmann machines. Both routes seem promising and are likely to produce improved error rates inline with [GMH13]’s results.

In [DBL12], the authors of both research groups suggest key gains will come from improved understanding of the pre-training process and how the types of units used in these models affect error rates. They conclude that distributed training is the largest hurdle to overcome for these systems to make use of more training data. (Parallelization is limited by the sequential stochastic gradient descent at the heart of the pre-training and training processes.) As [DYDA12] point out in their paper, GPU-based approaches can assist in reducing computation time, but more foundational approaches need to be pursued.

In a 2014 talk [Hin14], Hinton criticizes existing neural network architectures on philosophical grounds arguing that they do not correspond well enough to how the brain functions citing inadequate structural complexity. His proposed solution is a new neural network approach that clusters neurons together into capsules, which he believes will better model how the cortical columns of the brain behave. If Hinton is right (which his track record suggests), then it is likely we’ll see this capsule approach outperform existing models, and consequently, yield improved error rates in automatic speech recognition.


[Ada10] Andre Gustavo Adami. Automatic speech recognition: From the beginning to the portuguese language. In The Int. Conf. on Computational Processing of Portuguese (PROPOR). Rio Grande do Sul: Porto Alegre, 2010.

[AMJ+14] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Ui. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech & Language Processing, 22(10):1533-1545, 2014.

[DBL12] Deep neural networks for acoustic modeling in speech recognition: The shared views of four research grounds. IEEE Signal Process. Mag., 29(6):82-97, 2012.

[DYDA12] George E. Dahl, Dong Ui, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, speech & Language Processing, 20(1):30-42, 2012.

[FLMS14] Luciana Ferrer, Yun Lei, Mitchell McLaren, and Nicolas Scheffer. Spoken language recognition based on senone posteriors. In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communnication Association, Singapore, September 14-18, 2014, pages 2150-2154. ISCA, 2014.

[GMH13] Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton, Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, pages 6645-6649, 2013.

[Hin02] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771-1800, 2002.

[Hin14] Geoffrey E. Hinton. What’s wrong with convolutional nets? Massachusetts Institute of Technology, Department of Brain and Cognitive Sciences, Fall Colloquium Series, 2014.

[HOT06] Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527-1554, 2006.

[JM08] Daniel Jurafsky and James H. Martin. Speech and Language Processing, 2nd Edition. Prentice Hall, 2008.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: