If we want machines to understand and generate human language, we need a way to measure how likely different word sequences are. That’s the job of a language model—to assign probabilities to sequences of words. And when it comes to n-gram models, we do this using a technique called Maximum Likelihood Estimation (MLE).
In this blog post, we’ll walk through how MLE works for n-gram models, using step-by-step calculations, mini-corpus examples, and a case study from the Berkeley Restaurant Project.
1. What Does a Language Model Do Again?
A language model helps answer questions like:
What's the probability of the sentence: “I want sushi”?
Or:
Given “I want”, what is the most likely next word?
N-gram models answer this by using statistics gathered from real-world text. But where do these probabilities come from?
2. Enter Maximum Likelihood Estimation (MLE)
MLE is a simple idea:
Estimate a probability by counting how often something happens, and divide it by how many opportunities it had to happen.
For a bigram model:
-
Estimate the probability of a word wn given the previous word wn-1 like this:
Where:
-
= Count of the bigram (wn-1 wn)
-
= Count of the preceding word
This is MLE: we use the relative frequency from our training corpus to estimate the probability.
3. A Mini Example: The "I Am Sam" Corpus
Let’s use a toy dataset with three short sentences:
We first preprocess the sentences by adding special tokens:
-
<s>= sentence start -
</s>= sentence end
Step 1: Count the bigrams
We extract all adjacent word pairs (bigrams):
| Bigram | Count |
|---|---|
<s> I | 2 |
I am | 2 |
am Sam | 2 |
Sam </s> | 1 |
Sam I | 1 |
do not | 1 |
not like | 1 |
like green | 1 |
green eggs | 1 |
eggs and | 1 |
and ham | 1 |
ham </s> | 1 |
Step 2: Count the unigram (preceding word)
We also count how many times each preceding word occurs as the first part of a bigram.
| Word | Count |
|---|---|
<s> | 3 |
I | 3 |
am | 2 |
Sam | 2 |
4. Calculate Bigram Probabilities
Using MLE:
Example 1: P(I | <s>)
Example 2: P(Sam | am)
Example 3: P(am | I)
So if your model sees “I am”, it knows that “Sam” is very likely to follow.
5. Scaling Up: The Berkeley Restaurant Project
The Berkeley Restaurant Project (BeRP) was a real-world dialog system that answered restaurant queries like:
“Tell me about a good Chinese place nearby.”
The corpus contains thousands of user sentences. Let's look at a simplified table of bigram counts from a small vocabulary of 8 words:
| I | want | to | eat | Chinese | food | lunch | spend | |
|---|---|---|---|---|---|---|---|---|
| I | 5 | 827 | 0 | 9 | 0 | 0 | 0 | 2 |
| want | 2 | 0 | 608 | 1 | 6 | 6 | 5 | 1 |
| to | 2 | 0 | 4 | 686 | 2 | 0 | 6 | 211 |
These raw counts tell us, for example:
-
“want to” occurred 608 times.
-
“to eat” occurred 686 times.
-
“eat food” occurred 42 times.
To estimate the probability of “food” given “eat”, we just divide:
Example: P(food | eat)
This means: about 5.6% of the time, when people say “eat”, the next word is “food”.
6. From Sentence to Probability
You can now estimate the probability of entire sequences.
Let’s say we want to compute the probability of:
“I want Chinese food”
Assuming <s> is the start token and </s> is the end:
From real BeRP data, we might get:
The lower the final value, the less likely this full sequence is under that model. In practice, we’d use log-probabilities to avoid underflow in long sequences.
7. Key Takeaways
-
MLE gives us a fast, intuitive way to estimate n-gram probabilities from counts.
-
Probabilities are computed by dividing bigram counts by unigram counts.
-
Even simple models can capture realistic language patterns (e.g., “want to eat” vs. “eat food”).
-
Models built from real data (like BeRP) give more useful predictions than toy corpora.
-
Bigger corpora = better estimates, especially for rare combinations.
No comments:
Post a Comment