N-gram language models — Part 1

MTI Technology AI Lab

6 min readJul 28, 2021

1. Background

2. Data

3. Unigram language model

3.1 What is a unigram?

3.2 Training the model

0.1 Evaluating the model

0.2 The role of ending symbols

0.3 Evaluation metric: average log-likelihood

Background

Language modeling — that is, predicting the probability of a word in a sentence — is a fundamental task in natural language processing. It is used in many NLP applications such as autocomplete, spelling correction, or text generation.

Currently, language models based on neural networks, especially transformers, are the state of the art: they predict very accurately a word in a sentence based on surrounding words. However, in this project, I will revisit the most classic language model: the n-gram models.

In part 1 of the project, I will introduce the unigram model i.e. model based on single words. Then, I will demonstrate the challenge of predicting the probability of “unknown” words — words the model hasn’t been trained on — and the Laplace smoothing method to alleviate this problem. I will then show that this smoothing method can be viewed as model interpolation, which is the combination of different models together.
In part 2 of the project, I will introduce higher n-gram models and show the effect of model interpolation of these n-gram models.
In part 3 of the project, aside from interpolating the n-gram models with equal proportions, I will use the expectation-maximization algorithm to find the best weights to combine these models.

Data

In this project, my training data set — appropriately called train — is “A Game of Thrones”, the first book in the George R. R. Martin fantasy series that inspired the popular TV show of the same name.

Then, I will use two evaluating texts for our language model:

The first text, called dev1, is the book “A Clash of Kings”, the second book in the same series as the training text. The word dev stands for development, as we will later use these texts to fine-tune our language model.
The second text, called dev2, is the book “Gone with the Wind”, a 1939 historical fiction book set in the American South, written by Margaret Mitchell. This book is selected to show how our language model will fare when evaluated on a text that is very different from its training text.

Unigram language model

What is a unigram?

In natural language processing, an n-gram is a sequence of n words. For example, “statistics” is a unigram (n = 1), “machine learning” is a bigram (n = 2), “natural language processing” is a trigram (n = 3), and so on. For longer n-grams, people just use their lengths to identify them, such as 4-gram, 5-gram, and so on. In this part of the project, we will focus only on language models based on unigrams i.e. single words.

Training the model

A language model estimates the probability of a word in a sentence, typically based on the words that have come before it. For example, for the sentence “I have a dream”, our goal is to estimate the probability of each word in the sentence based on the previous words in the same sentence:

For simplicity, all words are lower-cased in the language model, and punctuations are ignored. The [END] token marks the end of the sentence and will be explained shortly.

The unigram language model makes the following assumptions:

The probability of each word is independent of any words before it.
Instead, it only depends on the fraction of time this word appears among all the words in the training text. In other words, training the model is nothing but calculating these fractions for all unigrams in the training text.

Evaluating the model

After estimating all unigram probabilities, we can apply these estimates to calculate the probability of each sentence in the evaluation text: each sentence probability is the product of word probabilities.

We can go further than this and estimate the probability of the entire evaluation text, such as dev1 or dev2. Under the naive assumption that each sentence in the text is independent of other sentences, we can decompose this probability as the product of the sentence probabilities, which in turn are nothing but products of word probabilities.

The role of ending symbols

As outlined above, our language model not only assigns probabilities to words but also probabilities to all sentences in a text. As a result, to ensure that the probabilities of all possible sentences sum to 1, we need to add the symbol [END] to the end of each sentence and estimate its probability as if it is a real word. This is a rather esoteric detail, and you can read more about its rationale here (page 4).

Evaluation metric: average log-likelihood

When we take the log on both sides of the above equation for the probability of the evaluation text, the log probability of the text (also called log-likelihood), becomes the sum of the log probabilities for each word. Lastly, we divide this log-likelihood by the number of words in the evaluation text to ensure that our metric does not depend on the number of words in the text.

For n-gram models, log of base 2 is often used due to its link to information theory (see here, page 21)

As a result, we end up with the metric of average log-likelihood, which is simply the average of the trained log probabilities of each word in our evaluation text. In other words, the better our language model is, the probability that it assigns to each word in the evaluation text will be higher on average.

Other common evaluation metrics for language models include cross-entropy and perplexity. However, they still refer to the same thing: cross-entropy is the negative of average log-likelihood, while perplexity is the exponential of cross-entropy.

Please check more detail from the Link

Please check Gaussian Sample and Bayesian Statistics.

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

Data Science Project

Please check about experiences for Data Science Project

Vietnam AI / Data Science Lab

Please also visit Vietnam AI Lab