# Antimatroid, The

thoughts on computer science, electronics, mathematics

## Deep Learning for Automatic Speech Recognition

leave a comment »

### Introduction

The problem of automatic speech recognition, and details of the traditional Hidden Markov Model and Gaussian Mixture Model hybrid architecture (HMM-GMM) for acoustic modeling are detailed in [JM08], but will be skipped here. Instead, the focus of this literature review is to discuss how [DYDA12] uses a context dependent Hidden Markov Model and Deep Neural Network hybrid architecture (CD-HMM-GMM) for acoustic modeling as it represents a significant improvement over the traditional HMM-GMM approach. This review will begin with motivation for the architecture, then go into detail the algorithms used for pre-training, and outline the algorithms used for training before concluding with how well the approach outperforms the standard HMM-GMM approach.

#### Architecture

To motivate their architecture, [DYDA12] rely on the standard noisy channel model for speech recognition presented in [JM08] where we wish to maximize the likelihood of a decoded word sequence given our input audio observations:

 $\displaystyle \hat{w} = \underset{w \in L}{\text{argmax }} \mathbb{P} \left( w \lvert x \right ) = \underset{w \in L}{\text{argmax }} \mathbb{P} \left( x \lvert w \right ) \mathbb{P} \left( w \right )$ (1)

Where $\mathbb{P} \left( w \right )$ and $\mathbb{P} \left( x \lvert w \right )$ represent the language and acoustic models respectively. [JM08] state that the language model can be computed via an N-gram model; [DYDA12] acknowledge using this approach, but focus their efforts into explaining their acoustic model:

 $\displaystyle \mathbb{P} \left( x \lvert w \right ) = \sum_{q} \mathbb{P} \left( x, q \lvert w \right ) \mathbb{P} \left( q \lvert w \right ) \approxeq \max \pi(q_0) \prod_{t = 1}^T a_{q_{t-1} q_t} \prod_{t=0}^T \mathbb{P} \left( x_t \lvert q_t \right )$ (2)

Here the acoustic model is viewed as a sequence of transitions between states of tied-state triphones which [DYDA12] refer to as senones giving us the context dependent aspect of the architecture. [FLMS14] explains that senones represent the pronunciation of words and are derived by decision trees. By tying triphone states together, this approach is able to avoid having to process a large number of triphones and avoid the likely sparseness of training examples for every possible triphone.

The model assumes that there is a probability $\pi(q_0)$ for the starting state, probabilities $a_{q_{t-1} q_{t}}$ of transitioning to the state observed at step $t -1$ to step $t$, and finally, the probability of the acoustics given the current state $q_t$. [DYDA12] expand this last term further into:

 $\displaystyle \mathbb{P} \left( x_t \lvert q_t \right ) = \frac{\mathbb{P} \left( q_t \lvert x_t \right ) \mathbb{P} \left( x_t \right ) }{\mathbb{P} \left( q_t \right ) }$ (3)

Where $\mathbb{P} \left( x_t \lvert q_t \right )$ models the tied triphone senone posterior given mel-frequency cepstral coefficients (MFCCs) based on 11 sampled frames of audio. While MFCCs come from signal processing, they have proven to be effective features for automatic speech recognition. Based on the power spectrum derived from sample audio frames, MFCCs represent characteristics of the audio that our ears are sensitive to as explained in [Ada10]. $\mathbb{P} \left( q_t \right )$ is the prior probability of the senone, and $\mathbb{P} \left( x_t \right )$ can be ignored since it does not vary based on the decoded word sequence we are trying to find.

Based on this formalism, [DYDA12] chose to use a pre-trained Deep Neural Network to estimate $\mathbb{P} \left( q_t \lvert x_t \right )$ using MFCCs as DNN inputs and taking the senone posterior probabilities as DNN outputs. The transitioning between events is best modeled by a Hidden Markov Model whose notation, $\pi, a, \text{and } q$ appears in Eq. (2). Now that we have an overview of the general CD-DNN-HMM architecture, we can look at how [DYDA12] train their model.

#### Pre-Training

Given the DNN model we wish to fit the parameters of the model to a training set. This is usually accomplished by minimizing a likelihood function and deploying a gradient descent procedure to update the weights. One complication to this approach is that the likelihood can be computationally expensive for multilayer networks with many nodes rendering the approach unusable. As an alternative, one can attempt to optimize a computationally tractable surrogate to the likelihood. In this case the surrogate is the contrastive divergence method developed by [Hin02]. This sidestep enabled [HOT06] to develop an efficient unsupervised greedy pre-training process whose results can then be refined using a few iterations of the traditional supervised backpropagation approach. In this portion of the paper we discuss the work of [Hin02] and explain the greedy algorithm of [HOT06] before going on to discuss the high-level training procedure of [DYDA12].

To understand the pre-training process, it is necessary to discuss the Restricted Boltzmann Machine (RBM) and Deep Belief Network (DBN) models. RBMs are an undirected bipartite graphical model with Gaussian distributed input nodes in a visible layer connecting to binary nodes in a hidden layer. Every possible arrangement of hidden, $h$, and visible, $v$, nodes is given an energy under the RBM model:

 $\displaystyle E(v, h) = - b^T v - c^T h - v^T W h$ (4)

Where $W$ is the weight of connections between nodes and vectors $b$ and $c$ correspond to the visible and hidden biases respectively. The resulting probability is then given by:

 $\displaystyle \mathbb{P} \left( v, h \right ) = \frac{e^{-E(v, h)}}{Z}$ (5)

Where $Z$ is a normalization factor. Based on the assumptions of the RBM, [DYDA12] derive expressions for $\mathbb{P} \left( h = 1 \lvert v \right )$ and $\mathbb{P} \left( v = 1 \lvert h \right )$ given by:

 $\displaystyle \mathbb{P} \left( h = 1 \lvert v \right ) = \sigma(c + v^T W) \qquad \mathbb{P} \left( v = 1 \lvert h \right ) = \sigma(b + h^T W^T)$ (6)

Where $\sigma$ is an element-wise logistic function. [DYDA12] argue that Eq. (6) allows one to repurpose the RBM parameters to initialize a neural network. Training of the RBM is done by stochastic gradient descent against the negative log likelihood since we wish to find a stable energy configuration for the model:

 $\displaystyle - \frac{\partial \ell(\theta)}{\partial w_{ij}} = \langle v_i h_j \rangle_\text{data} - \langle v_i h_j \rangle_\text{model}$ (7)

however [DYDA12] point out that the gradient of the negative log likelihood cannot be computed exactly since the $\langle \cdot \rangle_\text{model}$ term takes exponential time. As a result, the contrastive divergence method is used to approximate the derivative:

 $\displaystyle - \frac{\partial \ell(\theta)}{\partial w_{ij}} = \langle v_i h_j \rangle_\text{data} - \langle v_i h_j \rangle_\text{1}$ (8)

where $\langle \cdot \rangle_\text{1}$ is a single step Gibbs sampled expectation. These terms are expectations in which nodes $i \text{ and } j$ are simultaneously active given the training data and model. Given this insight, regular stochastic gradient descent can be performed and the parameters of a RBM fitted to training data.

Now that we have an understanding of RBMs, we can shift our focus to DBNs. A Deep Belief Network is a multilayer model with undirected connections between the top two layers and directed between other layers. To train these models, [HOT06] had the insight to treat adjacent layers of nodes as RBMs. One starts with the bottom two layers and trains them as though they were a single RBM. Once those two layers are trained, then the top layer of the RBM is treated as the input layer of a new RBM with the layer above that layer acting as the hidden layer of the new RBM. The sliding window over the layers continues until the full DBN is trained. After this, [HOT06] describe an “up-down” algorithm to further refine the learned weights. The learned parameters of this greedy approach can then be used as the parameters of a DNN as explained earlier in the discussion of Eq. (6).

#### Training

Training of the CD-DNN-HMM model consists of roughly a dozen involved steps. We won’t elaborate here on the full details of each step, but will instead provide a high-level sketch of the procedure to convey its general mechanics.

The first high-level step of the procedure is to initialize the CD-DNN-HMM model. This is done by first training a decision tree to find the best tying of triphone states which are then used to train a CD-GMM-HMM system. Next, the unique tied state triphones are each assigned a unique senone identifier. This mapping will then be used to label each of the tied state triphones. (These identifiers will be used later to refine the DNN.) Finally, the trained CD-GMM-HMM is converted into a CD-DNN-HMM by retaining the triphone and senone structure and HMM parameters. This resulting DNN goes through the previously discussed pre-training procedure.

The next high-level step iteratively refines the CD-DNN-HMM. To do this, first the originally trained CD-GMM-HMM model is used to generate a raw alignment of states which is then mapped to its corresponding senone identifier. This resulting alignment is then used to refine the DBN by backpropagation. Next, the prior senone probability is estimated based on the number of frames paired with the senone and the total number of frames. These estimates are then used to refine the HMM transition probabilities to maximize the features. Finally, if this newly estimated parameters do not improve accuracy against a development set, then the training procedure terminates; otherwise, the procedure repeats this high-level step.

### Experimental Results

#### System Configurations

[DYDA12] report that their system relies on nationwide language model consisting of 1.5 million trigrams. For their acoustic model, they use a five hidden layer DNN with each layer containing 2,048 hidden units. Training the system from scratch on 24 hours of training data takes four days on a Dell T3500 workstation with an NVIDIA Tesla GPU. [DYDA12] emphasize the importance of the GPU in obtaining acceptable training time, and that without it, training time would be 30x slower.

#### Datasets and Metrics

Comparison of automatic speech recognition system consists of three principle error metrics: sentence (SER), word (WER), and phoneme (PER) error rates. These look at the ratio of incorrect entities to the number of total entities with the exception of word error rate which uses a Levenshtein approach to measure the number of insertions, substitutions, and deletions relative to the total number of words. A sentence is considered incorrect if there is at least one incorrect word.

These error metrics often coincide with different datasets, in particular WER is reported for Switchboard, SER for Bing Mobile Voice Search (BMVS), and PER on TIMIT. Switchboard is a collection of phone conversations between two people, while BMVS is a collection of short spoken questions such as “The Med” or “Chautauqua Park” that are used to find these locations, while TIMIT is a phonetic focused corpus of spoken sentences that are phonetically rich.

#### Results

 Switchboard BMVS TIMIT (WER) (SER) (PER) GMM 23.6[2] 36.2[1] 21.7[2] DNN 16.1[2] 30.4[1] 21.9[3] CNN – – 20.2[3] RNN – – 17.7[4]

Direct comparison of models is complicated by the variety of error metrics and datasets; [DBL12] is used to fill in these gaps to make a meaningful comparison. As one can see from Table (1), the neural network approaches do better on average over the traditional GMM approach. To illustrate that it is not only DNN approaches that do better, the work of [AMJ+14] using a Convolutional Neural Network (CNN) and [GMH13] using a Recurrent Neural Network (RNN) are included to further drive the point that neural network architectures are viable alternatives to GMMs.

### Conclusions

[DYDA12], [AMJ+14], and [GMH13] have shown that neural network architectures exhibit better performance over Gaussian Mixture Models. [DYDA12] believes that a more capable first layer model provided by mean-covariance restricted Boltzmann machines will increase performance, while [AMJ+14] plans to investigate unexpected improvements in large-vocabulary speech recognition where they were absent in phone recognition tasks when using convolutional restricted Boltzmann machines. Both routes seem promising and are likely to produce improved error rates inline with [GMH13]’s results.

In [DBL12], the authors of both research groups suggest key gains will come from improved understanding of the pre-training process and how the types of units used in these models affect error rates. They conclude that distributed training is the largest hurdle to overcome for these systems to make use of more training data. (Parallelization is limited by the sequential stochastic gradient descent at the heart of the pre-training and training processes.) As [DYDA12] point out in their paper, GPU-based approaches can assist in reducing computation time, but more foundational approaches need to be pursued.

In a 2014 talk [Hin14], Hinton criticizes existing neural network architectures on philosophical grounds arguing that they do not correspond well enough to how the brain functions citing inadequate structural complexity. His proposed solution is a new neural network approach that clusters neurons together into capsules, which he believes will better model how the cortical columns of the brain behave. If Hinton is right (which his track record suggests), then it is likely we’ll see this capsule approach outperform existing models, and consequently, yield improved error rates in automatic speech recognition.

### References

[Ada10] Andre Gustavo Adami. Automatic speech recognition: From the beginning to the portuguese language. In The Int. Conf. on Computational Processing of Portuguese (PROPOR). Rio Grande do Sul: Porto Alegre, 2010.

[AMJ+14] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Ui. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech & Language Processing, 22(10):1533-1545, 2014.

[DBL12] Deep neural networks for acoustic modeling in speech recognition: The shared views of four research grounds. IEEE Signal Process. Mag., 29(6):82-97, 2012.

[DYDA12] George E. Dahl, Dong Ui, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, speech & Language Processing, 20(1):30-42, 2012.

[FLMS14] Luciana Ferrer, Yun Lei, Mitchell McLaren, and Nicolas Scheffer. Spoken language recognition based on senone posteriors. In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communnication Association, Singapore, September 14-18, 2014, pages 2150-2154. ISCA, 2014.

[GMH13] Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton, Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, pages 6645-6649, 2013.

[Hin02] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771-1800, 2002.

[Hin14] Geoffrey E. Hinton. What’s wrong with convolutional nets? Massachusetts Institute of Technology, Department of Brain and Cognitive Sciences, Fall Colloquium Series, 2014.

[HOT06] Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527-1554, 2006.

[JM08] Daniel Jurafsky and James H. Martin. Speech and Language Processing, 2nd Edition. Prentice Hall, 2008.

Advertisements

Written by lewellen

2015-06-01 at 8:00 am

## An Experiment in Optical Character Recognition

leave a comment »

### Introduction

I’ve always been interested in machine learning and how it can be applied to a number of different problems. I spent some time during college learning some of the techniques used in machine learning, but since then I’ve gotten a bit rusty. So, I revisited the subject by doing some additional research and decided to try my hand at Optical Character Recognition (OCR)- the process of identifying characters within an image and producing a text output. There are a handful of traditional aspects to this process: image acquisition, segmentation, recognition and correction. Acquisition is about correcting an input image so that a segmentation process can be readily applied; segmentation identifies the characters in the image, recognition maps those visual characters to text characters, finally correction takes the text output and rectifies any errors. The following outlines my approach to segmentation and recognition.

### Segmentation

Consider the following body of text from one of my earlier posts:

The goal is to extract each of the characters in the image. The simplest way to accomplish this is to implement an algorithm that reads the image much in the same way one might read a block of text: start at the top of the text and scan downward identifying all of the rows of text, then for each row, read all the characters from left to right, then identify words based on white space. Figuring out the word boundaries is done via a simple clustering process. Assuming we have an ordered set of rectangles produced by the first two steps, we can calculate the average distance between consecutive rectangles. Then, once this average has been produced, to then iterate over the list once more adding rectangles to words when the distance between consecutive rectangles is less than the average distance, then creating new words when the distance is exceeded.

This segmentation approach isn’t perfect as it assumes that we are dealing with evenly spaced characters of the same size from the same font. Of course, this isn’t always the case and we have things like kerning and ligatures to deal with. In this example two categories of problems arise: character combinations such as ay, ey and ly that are separable then combinations such as yw, rt and ct that are not separable end up being interpreted as a single character using this method. For the first category, I chose to divide rectangles whenever a line of characters has a single black pixel that does not have a black pixel neighboring ((x-1, y + a) | a <- [-1, 1]) it to the left. The second case isn't as clear cut as all the pixels neighbor one another, in principal one could implement a k-means clustering algorithm, but that assumes you know how many characters (k) you have in the image. I decided to allow the error to propagate through the system.

### Recognition

Starting out, I knew that I wanted to use an Artificial neural network (ANN), so I spent some time reading Stuart’s and Norvig’s “Artificial Intelligence: A Modern Approach“, but felt that it wasn’t as comprehensive as I wanted, so I also read MacKay’s “Information Theory, Inference and Learning Algorithms,” which was more in tune with what I had in mind. I also came across a series (1, 2, 3) of pdf files hosted at Aberdeen’s Robert Gordon University that provided a more practical view of applying neural networks to pattern recognition.

A little background on ANNs: The general idea is that an individual neuron aggregates the weighted inputs from other neurons then outputs a signal. The magnitude of the signal is determined as a function of the aggregation called the activation function. Neurons are organized into layers, typically an input layer, one or more hidden layers and finally an output layer. Values from the input layer and feed into the hidden layer, then those outputs feed into the next and so on until all the values have gone through the output layer. The process of getting the network to map an input to an output is accomplished through training, in this case, a method known as Backpropagation. Given an input and an expected output, the input is feed through the network and produces an output. The difference between the two output vectors is the error that then needs to be feed backward through the network updating each node’s input weights such that the net error of the system is reduced. The method is effectively a gradient descent algorithm. I recommend reading the aforementioned references for details on how it all works. Following is my implementation of the Backpropagation algorithm:

using System;
using System.Linq;
using Library.Mathematics;

namespace OCRProject.Recognizers.NeuralNetworks {
public class Neuron {
Func<double, double> activationFunction;

public Vector InputWeights { get; set; }

public Neuron(Func<double, double> activationFunction) {
this.activationFunction = activationFunction;
}

public double Output(Vector input) {
return activationFunction(InputWeights.dot(input));
}
}

public class NeuralNetwork {
private Neuron[] hiddenLayer, outputLayer;

...

public Vector Output(Vector input) {
Vector hiddenLayerOutput = new Vector(hiddenLayer.Length, (i) => hiddenLayer[i].Output(input));
return new Vector(outputLayer.Length, (i) => outputLayer[i].Output(hiddenLayerOutput));
}

public Vector Train(Vector input, Vector desiredOutput, double learningRate) {
Vector hOutput = new Vector(hiddenLayer.Length, (i) => hiddenLayer[i].Output(input));
Vector oOutput = new Vector(outputLayer.Length, (i) => outputLayer[i].Output(hOutput));

Vector oError = new Vector(
oOutput.Dimension,
(i) => oOutput[i] * (1 - oOutput[i]) * (desiredOutput[i] - oOutput[i])
);

for (int n = 0; n < outputLayer.Length; n++) {
outputLayer[n].InputWeights = new Vector(
hiddenLayer.Length,
(i) => outputLayer[n].InputWeights[i] + learningRate * oError[n] * hOutput[i]
);
}

Vector hError = new Vector(
hiddenLayer.Length,
(i) => hOutput[i] * (1 - hOutput[i]) * (oError.dot(new Vector(oError.Dimension, (j) => outputLayer[j].InputWeights[i])))
);

for (int n = 0; n < hiddenLayer.Length; n++) {
hiddenLayer[n].InputWeights = new Vector(
input.Dimension,
(i) => hiddenLayer[n].InputWeights[i] + learningRate * hError[n] * input[i]
);
}

return oError;
}
}
}


In terms of how all of this applies to OCR, I started out with all visible characters and produced a collection of 16×16 images. Each image contained a single character centered in the image. This image was then mapped to a 256 dimensional vector with its corresponding character mapped to an 8 dimensional vector representing the expected output. The question that remains is how many hidden layer units should be used. To figure this out, I conducted a small experiment:

 Hidden Units 11 Minimum (%) Average (%) Maximum (%) 1 2 3 4 5 6 7 8 9 10 12 13 14 15 16 0 4 7 21 36 48 69 80 87 85 92 88 92 89 93 93 1 4 12 28 44 56 72 83 89 90 94 93 94 93 95 95 1 6 15 31 51 62 75 87 91 93 99 96 98 95 98 99

Each run consisted of 10 trials with each trial taking 5000 iterations to train the network. (Normally, one can measure the Mean squared error of the network and halt the training process once it is sufficiently small.) After doing this testing, I found that ANNs with 11 hidden units looked to give the highest accuracy with the fewest number of units. Given the original text that was used, the following text was produced:

As expected, the category of errors that were identified earlier (character combinations ff, rt, ct) were not segmented correctly and consequently not recognized correctly.

### Wrap-up

There is a lot of additional work that could be thrown at the project. In the future, it’d be good to modify the solution to accept any sized font as well as any font, adding support for images containing scanned or photographed text rather than computer generated images and some error correction on the output to flag parts of the text that require review. I suspect that I will continue down this road and start investigating how these methods can be applied to general computer vision for some planned robotics projects.

Written by lewellen

2010-06-01 at 8:00 am