5. Language models

So far we have been concerned with classification problems where the input is text and the output is a categorical label. Starting from this week, we will consider problems where the output is a sequence of symbols. Many NLP tasks fall in this category.

Before going into sequence prediction, let’s consider the problem of density estimation, i.e. assigning a probability to a sequence of words. Why do we care about this problem? Well, for applcations where the output are sentences, e.g. speech recognition and machine translation, we want to measure how fluent the output sentences are; in other words, how likely that the sentences are generated by a native speaker of a certain language.

A language model assigns a probability to any sequence of words. Let \(x_1, \ldots, x_n\) be a sequence of \(n\) tokens, how should we model \(p(x_1, \ldots, x_n)\)? We have already encountered a similar problem in text classification. If we assume that \(x_i\)’s are independent like in Naive Bayes models, then we have \(p(x_1, \ldots, x_n) = \prod_{i=1}^np(x_i)\). Can we do better? Recall the chain rule of probability, we have

(5.1)\[p(x_1, \ldots, x_n) = p(x_1) p(x_2\mid x_1) \cdots p(x_n\mid x_1, \ldots, x_{n-1}) \;.\]

Now, we can model each conditional probability with a context of \(m\) words by a categorical distribution, and the MLE estimate of, say \(p(\text{jumped} \mid \text{the brown fox})\) is simply the count of “jumped” in our corpus divided by the count of the trigram “the brown fox”. The problem is that we will need a huge number of parameters for this model given that the number of conext increases exponentially with the context size. Due to the sparsity of language, we are unlikely to get enough data to estimate these parameters.

5.1. N-gram language models

To simplify the model above, let’s use the Markov assumption: a token only depends on \(k\) previous tokens:

(5.1.1)\[p(x_1, \ldots, x_n) \approx \prod_{i=1}^n p(x_i\mid x_{i-k}, \ldots x_{i-1}) \;.\]

Note that the Naive Bayes assumption corresponds to a unigram language model here. In general, a \(n\)-gram model assumes that each token depends on the previous \(n-1\) tokens.

The MLE estimate of the conditional probabilies are:

(5.1.2)\[p_{\text{MLE}}(x_i\mid x_{i-k}, \ldots x_{i-1}) = \frac{\text{count}(x_{i-k}, \ldots, x_{i-1}, x_i)}{\sum_{w\in V}\text{count}(x_{i-k}, \ldots, x_{i-1}, w)} = \frac{\text{count}(x_{i-k}, \ldots, x_{i-1}, x_i)}{\text{count}(x_{i-k}, \ldots, x_{i-1})} \;.\]

In words, the probability of a token following some context is simply the fraction of times we see that token following the context out of all tokens following the context in our corpus. Check out the ngram counts from Google Books.

Exercise. Derive the MLE estimate for n-gram language models. [Hint: Note that given a context, the conditional probabilities need to sum to one. You can use Langrange multiplier to solve the constrained optimization problem.]

5.1.1. Backoff and interpolation

In practice, what context size should we use? Larger \(n\) helps us capture long-range dependency in language, but small \(n\) allows for more accurate estimation.

Backoff. One simple idea is to use larger \(k\) when we have more “evidence”. For example, use the maximum \(n\) where we have more than \(\alpha\) counts, and the minimum \(n\) we use is 1, i.e. unigram.

Interpolation. A better idea is to interpolate probabilities estimated from different n-grams instead of committing to one. For example,

(5.1.3)\[p(x_i\mid x_{i-1}, x_{i-2}) = \lambda_1 p(x_i) + \lambda_2 p(x_i\mid x_{i-1}) + \lambda_3 p(x_i\mid x_{i-1}, x_{i-2}) \;.\]

We can choose \(\lambda_i\)’s using cross validation.

5.1.2. Smoothing

Note that the above model would assign zero probability to any sequence containing a word that never occurs in the training corpus, which is obviously undersirable. So we would like to assign probabilities to unseen words as well.

One such technique we have already seen is Laplace smoothing where we add one to each count. It works well for text classification where we only considered unigram probabilities. However, for longer context, many tokens are unlikely to occur. For example, given the context “I have just”, only some words (e.g. verbs) are likely to follow; even if we have access to infinite data, the probability of certain words are close to zero. Thus Laplace smoothing would assign too much probability mass to unseen words when the true occurences are sparse. One quick fix is to use a pseudocount of \(\alpha\) where \(\alpha \lt 1\), but the optimal value is data dependent.

Next, let’s consider a better solution. Instead of allocating a fixed amount of probability mass to the unseen words, we can estimate the probability of an unseen word by the probability of words that we’ve seen once, assuming that the frequencies of these words are similar. For example, suppose you have arrived on a new planet and want to estimate the probability of different species on this planet. So far, you have observed 3 tats, 4 gloins, 1 dido, and 1 bity. What is the chance that the next animal you encounter will be of an unseen species? We can estimate it by the probability of dido and bity which have occurred once, i.e. \((1+1)/(3+4+1+1)=2/9\). Good-Turing smoothing

Let’s consider the problem of estimating the count of species (words in the case of language modeling), given observations of a subset of these species. Let \(N_r\) be the number of species that have occurred \(r\) times. For example,





dido, bity, …



tike, wab, …


We simulate the scenario with unseen species by cross validation, i.e. set aside a subset of objects as the held-out set. Now, consider leave-one-out cross validation: for a dataset of size \(M\), we run \(M\) experiment where each time exactly one object is taken as the held-out set and the rest \(M-1\) objects form the training set. In the \(M\) held-out sets, how many objects never occur in their corresponding training set? Note that once we move the objects occurring only once in the training set to the held-out set, their counts in the training set would be zero. Thus the fraction of held-out objects that never occur in the training set is \(\frac{N_1}{M}\). Similarly, the fraction of held-out objects that occur \(k\) times in the training set is \(\frac{(k+1)N_{k+1}}{M}\). Therefore, we estimate the probability of unseen objects by

(5.1.4)\[p_0 = \frac{N_1}{M} \;,\]

and the probability of objects that occur \(k\) times in the training set by

(5.1.5)\[p_k = \frac{(k+1)N_{k+1}}{MN_{k}} \;.\]

We divide by \(N_k\) when computing \(p_k\) because it could be any one of the \(N_k\) species. Comparing \(p_k\) with the MLE estimate \(\frac{k}{M}\), we see that Good-Turing smoothing uses an adjusted count \(c_k = \frac{(k+1)N_{k+1}}{N_k}\). For text, usually we have \(N_k \gt N_{k+1}\) and \(c_k \lt k\), i.e. the counts of observed words are discounted and some probability mass is allocated to the unseen words.

In practice, Good-Turing estimation is not used directly for n-gram language models because it doesn’t combine higher-order models with lower-order ones, but the idea of using discounted counts is used in other smoothing techniques.

TODO: demo Kneser-Ney smoothing

Let’s now turn to Kneser-Ney smoothing, which is widely used for n-gram language models. There are two key ideas in Kneser-Ney smoothing.

First, use absolute discounting. In Good-Turing smoothing, we use discounted counts for each n-gram. It turns out that in practice the count is often close to 0.75. So instead of computing the Good-Turing counts, let’s just subtract 0.75 or some constant. Take a bigram language model for example, we have

(5.1.6)\[p(x_i\mid x_{i-1}) = \frac{\max(\text{count}(x_{i-1}, x_i) - \delta, 0)}{\text{count}(x_{i-1})} \;,\]

where \(\delta\) is the discount.

Second, consider versatility when interploating with lower-order models. Note that in interplation, lower-order models are crucial only when the higher-order context is rare in our training set. As a motivating example, consider the bigram “San Francisco”. Suppose “San Francisco” is a frequenty phrase in our corpus, then the word “Francisco” will have high unigram probability, however, it almost always occurs after “San”. In an interplated model, if “San” is in the context, then the bigram model should provide a good estimate; if “San” is not in the context and we backoff to the unigram model, the MLE estimate would assign large probability to “Francisco”, which is undersirable. Therefore, instead of using the unigram probability of “Francisco”, we compute the fraction of context followed by it: \(\frac{\text{number of bigram types ends with "Francisco"}}{\text{total number of bigrams types}}\). This can be considered as the versatility of the word as it measures how many distinct context the word can follow (normalized by the total number of context). More generally, we have

(5.1.7)\[\beta(x_i) = \frac{|\{x\in V\colon \text{count}(x, x_i) > 0\}|} {|\{x, x'\in V\colon \text{count}(x, x') > 0\}|} \;.\]

Note that (a) \(\beta(w_i)\) is not a probabilty distribution, although it is between 0 and 1; (b) we count types, or unique n-grams, and don’t use counts as in the MLE estimate.

Now, putting absolute discount and versatility together, we have

(5.1.8)\[p_{\text{KN}}(x_i\mid x_{i-1}) = \frac{\max(\text{count}(x_{i-1}, x_i) - \delta, 0)}{\text{count}(x_{i-1})} + \lambda(x_{i-1})\beta(x_i) \;,\]

where \(\lambda(x_{i-1})\) is a normalization constant to make sure that \(\sum_{x\in V} p_{\text{KN}}(x\mid x_{i-1}) = 1\).

Exercise: Show that \(\lambda\) depends on the context.

For higher-order models, we can define it recursively as

(5.1.9)\[p_{\text{KN}}(x_i\mid x_{i-k}, \ldots, x_{i-1}) = \frac{\max(\text{count}(x_{i-k}, \ldots, x_i) - \delta, 0)}{\text{count}(x_{i-k}, \ldots, x_{i-1})} + \lambda(x_{i-k}, \ldots, x_{i-1}) p_{\text{KN}}(x_i\mid x_{i-k+1}, \ldots, x_{i-1}) \;.\]

5.2. Neural language models

5.2.1. Language modeling as a classification task

In n-gram language models, we model \(p(x_i\mid x_{i-k}, \ldots, x_{i-1})\) by a multinomial distribution. If we consider \(x_i\) as the label of the input context \(x_{i-k}, \ldots, x_{i-1}\), this becomes a text classification problem and we can use logistic regression:

(5.2.1)\[p(x_i\mid x_{i-k}, \ldots, x_{i-1}) = \frac{\exp\left [ w_i\cdot\phi(x_{i-k}, \ldots, x_{i-1}) \right ]} {\sum_{j\in|V|}\exp\left [ w_j\cdot\phi(x_{i-k}, \ldots, x_{i-1}) \right ]} \;.\]

The main task now is to design the feature extractor \(\phi\).

Exercise: Design a feature extractor. What would be useful features for predicting the next word?

5.2.2. Feed-forward neural networks

Before neural networks dominated NLP, a lot of effort in building NLP models goes into feature engineering, which is basically the exercise you went through above. The key idea in neural networks is to directly learn these features instead of manually designing them. We have already seen something similar in learned word embeddings: we don’t necessarily know what each dimension means, but we know that it would represent some useful information for predicting words in the context for example.

Now, consider a general binary classification task with raw inputs \(x=[x_1, \ldots, x_p]\). Instead of specifying features \(\phi_1(x), \phi_2(x), \ldots\), let’s learn \(k\) intermediate features \(h(x) = [h_1(x), \ldots, h_k(x)]\). In neural networks, \(h_i\)’s are called hidden units. Then we can make predictions based on these features using the score \(w^Th(x)\). How should we parameterize \(h_i\)’s then? One option is to use a linear function that we’re already pretty familiar with: \(h_i(x) = w_i^Tx\). However, the composition of linear functions is still linear, so we didn’t really gain anything from learning these intermediate features.

The power from neural networks come from its non-linear activation function:

(5.2.2)\[h_i(x) = \sigma(w_i^Tx) \;,\]

where \(\sigma\) is usually a non-linear, differentiable (for SGD) function. Here are some common activation functions:

%matplotlib inline
from IPython import display
from matplotlib import pyplot as plt
import numpy as np


x = np.arange(-4, 4, 0.01)
plt.plot(x, np.tanh(x), label='tanh')
plt.plot(x, 1/(1 + np.exp(-x)), label='sigmoid')
plt.plot(x, np.fmax(x, 0), label='ReLU')
plt.ylim(-1, 1)
<matplotlib.legend.Legend at 0x7fbd1a5da160>

Now let’s replace our n-gram model with a feed-forward neural network. The input are \(k\) words in the context. Each word is mapped to a dense vector, which is then concatenated together to form a single vector representing the context. The last layer is a logistic function that predicts the next word.


Fig. 5.2.1 Feed-forward language model

Exercise: How can we use a BoW representation in feed-forward neural networks? What’s the advantage and disadvantage? Backpropogation

Neural network is just like any other model we have seen, so we can learn its parameters by minimizing the average loss using SGD. The main challenge here is that the objective function is now non-convex, which means that SGD may only lead us to a local optimum. However, in practice, we have found that SGD is quite effective for learning neural models.

Here our loss function is negative log-likelihood. Let’s compute the partial derivative w.r.t. \(W_{21}[ij]\) using the chain rule.

(5.2.3)\[\frac{\partial \ell}{\partial W_{21}[ij]} = \frac{\partial \ell}{\partial s_1[j]} \frac{\partial s_1[j]}{\partial W_{21}[ij]} \;,\]

where \(s_1[j] = W_{21}[\cdot j]^T e\). As an exercise, try to compute \(\frac{\partial ell}{\partial W_{11}[ij]}\). You will see that it depends on \(\frac{\partial \ell}{\partial W_{21}[ij]}\), which means that we can reuse previous results if we choose to compute the partial derivatives in a specific order!

In general, we can think of the function to be optimized as a computation graph, where the nodes are intermediate results (e.g. \(s_1\)) and the edges are the mapping from the input node to the output node. Backpropogation computes partial derivatives in specific orders to save computation (think dynamic programming). It can be automatically done using modern frameworks such as Tensorflow, PyTorch, and MXNet.

5.2.3. Recurrent neural networks (RNN)

Feed-forward neural language model uses a fixed-length context. However, intuitively some words only require a short context to predict, e.g. functional words like “of”, while others require longer context, e.g. pronouns like “he” and “she”. So we would like to have a dynamic length of context and capture long-range context beyond five words (which is the typical length of context in n-gram models).

Recurrent neural network is a model that captures arbitrarily long context. The key idea is to update the hidden units recurrently given new inputs:

(5.2.4)\[h_t = \sigma(\underbrace{W_{hh}h_{t-1}}_{\text{previous state}}+ \underbrace{W_{ih}x_t}_{\text{new input}} + b_h) \;.\]

Note that the definition of \(h_t\) is recurrent, thus it incorporates information of all inputs up to time step \(t\).


Fig. 5.2.2 Recurrent language model

Note that we can obtain the probability distribution of the next word using a softmax transformation of the output in Fig. 5.2.2:

(5.2.5)\[p(\cdot\mid x_1,\ldots,x_{t-1}) = \text{softmax}(o_t) \;.\] Backpropogation through time

How do we do backpropogation on RNNs? If \(h_t\) is not recurrent, e.g. it only depends on the input \(x_t\), then the procedure is the same as feed forward neural networks. The fact that \(h_t\) now depends on \(h_{t-1}\) and they depend on same parameters complicates the computation.

Let’s focus on the partial derivative \(\frac{\partial h_t}{\partial W_{hh}[ij]}\). (The rest can be easily computed, same as in feed forward neural networks.)

(5.2.6)\[\begin{split}\underbrace{\frac{\partial h_t}{\partial W_{hh}[ij]}}_{d_t} &= \frac{\partial}{\partial W_{hh}[ij]} \sigma( \underbrace{W_{hh}h_{t-1}+W_{ih}x_t}_{s}) \\ &= \frac{\partial \sigma}{\partial s} \frac{\partial s}{\partial W_{hh}[ij]} \\ &= \frac{\partial \sigma}{\partial s} \left ( h_{t-1} + W_{hh}\underbrace{\frac{\partial h_{t-1}}{\partial W_{hh}[ij]}}_{d_{t-1}} \right ) \;.\end{split}\]

Now that we have written the derivative \(d_t:=\frac{\partial h_t}{\partial W_{hh}[ij]}\) in a recurrent form, we can easily compute it.

However, there are several practical problems. If you expand the recurrent formula, you will see that it involves repreated multiplication of \(W_{hh}\). Why is this bad? First, it’s expensive (both time and space). Second, with large powers of \(W_{hh}\), the gradient vanishes if its eigenvalues are less than 1 and explodes if they are greater than 1. Two quick ways to fix the problmes are: First, truncate the backpropogation after \(k\) steps, i.e. we assume that \(h_{t-k}\) does not depend on previous states. This is usually achieved by detach in deep learning frameworks. Second, we might want to clip the gradient to avoid exploding gradient. Gated recurrent neural networks

Now, let’s try to fix the gradient vanishing/exploding problem from a modeling perspective. Note that in RNN the information in previous states influence future states only through \(W_{hh}\). The gradient explodes when an input has large impact on a distant output, i.e. there is long-range dependency. Similarly, the gradient may vanish when an input is irrelevant. Thus it would be desirable to have some mechanism to decide when to “memorize” a state and when to “forget” it. This is the key idea in gated RNNs.

Here we describe one variant of RNN with gating, the long-short term memory (LSTM) architecture. Following our intuition, we would like additional memory to save useful information in the sequence. Let’s design a memory cell. It should update the memory with new information from the current time step when it’s important, and reset the memory when the information stored in it is no longer useful. Let’s use \(\tilde{c}_t\) to denote the new memory. Updating or resetting the memory can be controled by two gates: an input gate \(i_t\) and a forget gate \(f_t\), both are vectors whose dimensions are the same as the memory cell. We compute the memory cell \(c_t\) by

(5.2.7)\[c_t = \underbrace{i_t \odot \tilde{c}_t}_{\text{update with new memory}} + \underbrace{f_t \odot c_{t-1}}_{\text{reset old memory}} \;,\]

where \(\odot\) denotes elementwise multiplication. The new memory \(\tilde{c}_t\) incorporates information from \(x_t\) to the previous hidden state \(h_{t-1}\):

(5.2.8)\[\tilde{c}_t = \tanh(W_{xc}x_t + W_{hc}h_{t-1} + b_c) \;.\]

We can think of \(i_t\) and and \(f_t\) as deciding the proportion of information in \(\tilde{c}_t\) and \(c_{t-1}\) to incorporate and retain respectively (along each dimension), thus their value should be between 0 and 1. Further, we make the decision based on past information in the sequence. Thus, we define

(5.2.9)\[\begin{split}i_t &= \text{sigmoid}(W_{xi}x_t + W_{hi}h_{t-1} + b_i) \;,\\ f_t &= \text{sigmoid}(W_{xf}x_t + W_{hf}h_{t-1} + b_f) \;.\end{split}\]

Finally, we can define the current hidden state \(h_t\) based on the memory cell \(c_t\):

(5.2.10)\[\begin{split}h_t &= o_t \odot c_t \;\text{, where} \\ o_t &= \text{sigmoid}(W_{xo}x_t + W_{ho}h_{t-1} + b_o) \;.\end{split}\]

Here \(o_t\) is the output gate controlling how much information to output (for prediction).

Now, it may seem a bit redundant to use an additional memory cell. We should be able to directly apply the gating mechnism to the hidden states. Indeed, this is the key idea in another popular RNN variant, gated recurrent unit (GRU). You can read more about GRUs in [D2L 9.1].

5.3. Evaluation

Like word embeddings, language modeling is not an application by itself. It’s usually used in downstream tasks like machine translation and speech recognition. So an extrinsic evaluation of language models would be to apply them in downstream tasks and measure the improvement voer baseline language models. We mainly discuss intrinsic evaluation here.

From an ML perspective, our goal is to minimize the expected loss (NLL in this case). By minimizing the average loss on the training data, we hope that the model will also have small loss on unseen test data. So we can evaluate language models by their held-out likelihood, i.e. the likelihood of the test data under the language model, which can be easily calculated as

(5.3.1)\[\ell({D}) = \sum_{i=1}^{|D|} \log p_\theta(x_i\mid x_{1:i-1}) \;,\]

where \({D}\) is the held-out set represented as a long sequence of tokens, \(\theta\) denotes the parameters of the language model, and \(x_{1:i-1}\) denotes the \(x_1, \ldots, x_{i-1}\).

To evaluate language models, we often use the information-theoretic quantity, perplexity, which has a nice physical meaning in this context. Perplexity can be computed from the held-out likelihood:

(5.3.2)\[\text{PPL}(D) = 2^{-\frac{\ell(D)}{|D|}} \;.\]

Here the log-likelihood and PPL must use the same base. Note that the exponent is the average NLL loss on the held-out set. Lower perplexity corresponds to higher held-out likelihood, thus is desirable.

In information theory, perplexity measures how well a distribution predicts a sample, and is defined as \(2^{H(p)}\) where \(H\) is the entropy and \(p\) is the distribution of the random variable. Entropy can be considered as the expected number of bits needed to encode the value of the random variable. In our case, we do not know the true distribution of text and can only estimate it by a language model. So we cannot compute \(H(p)\), instead, what we are computing is the cross-entropy of \(p\) (the true distribution) and \(p_\theta\) (our estimate): \(-\mathbb{E}_{X\sim p}\log p_\theta(X)\approx -\frac{1}{|D|}\sum_{x_i\in D}\log p_\theta(x_i)\). A low-perplexity model can encode the next word using fewer number of bits, i.e. it is better at compressing the text.

5.4. Additional reading