Text Lab: How Do We Estimate Probabilities in N-Gram Models?

If we want machines to understand and generate human language, we need a way to measure how likely different word sequences are. That’s the job of a language model—to assign probabilities to sequences of words. And when it comes to n-gram models, we do this using a technique called Maximum Likelihood Estimation (MLE).

In this blog post, we’ll walk through how MLE works for n-gram models, using step-by-step calculations, mini-corpus examples, and a case study from the Berkeley Restaurant Project.

1. What Does a Language Model Do Again?

A language model helps answer questions like:

What's the probability of the sentence: “I want sushi”?

Or:

Given “I want”, what is the most likely next word?

N-gram models answer this by using statistics gathered from real-world text. But where do these probabilities come from?

2. Enter Maximum Likelihood Estimation (MLE)

MLE is a simple idea:

Estimate a probability by counting how often something happens, and divide it by how many opportunities it had to happen.

For a bigram model:

Estimate the probability of a word wn given the previous word wn-1 like this:

P(wn \mid wn-1) = \frac{C(wn-1 \; wn)}{C(wn-1)}

Where:

$C(wn-1 \; wn)$ = Count of the bigram (wn-1 wn)
$C(wn-1)$ = Count of the preceding word

This is MLE: we use the relative frequency from our training corpus to estimate the probability.

3. A Mini Example: The "I Am Sam" Corpus

Let’s use a toy dataset with three short sentences:


<s> I am Sam </s>  
<s> Sam I am </s>  
<s> I do not like green eggs and ham </s>

We first preprocess the sentences by adding special tokens:

<s> = sentence start
</s> = sentence end

Step 1: Count the bigrams

We extract all adjacent word pairs (bigrams):

Bigram	Count
`<s> I`	2
`I am`	2
`am Sam`	2
`Sam </s>`	1
`Sam I`	1
`do not`	1
`not like`	1
`like green`	1
`green eggs`	1
`eggs and`	1
`and ham`	1
`ham </s>`	1

Step 2: Count the unigram (preceding word)

We also count how many times each preceding word occurs as the first part of a bigram.

Word	Count
`<s>`	3
`I`	3
`am`	2
`Sam`	2

4. Calculate Bigram Probabilities

Using MLE:

Example 1: `P(I | <s>)`

= \frac{C(<s> \; I)}{C(<s>)} = \frac{2}{3} ≈ 0.67

Example 2: `P(Sam | am)`

= \frac{C(am \; Sam)}{C(am)} = \frac{2}{2} = 1.0

Example 3: `P(am | I)`

= \frac{C(I \; am)}{C(I)} = \frac{2}{3} ≈ 0.67

So if your model sees “I am”, it knows that “Sam” is very likely to follow.

5. Scaling Up: The Berkeley Restaurant Project

The Berkeley Restaurant Project (BeRP) was a real-world dialog system that answered restaurant queries like:

“Tell me about a good Chinese place nearby.”

The corpus contains thousands of user sentences. Let's look at a simplified table of bigram counts from a small vocabulary of 8 words:

	I	want	to	eat	Chinese	food	lunch	spend
I	5	827	0	9	0	0	0	2
want	2	0	608	1	6	6	5	1
to	2	0	4	686	2	0	6	211

These raw counts tell us, for example:

“want to” occurred 608 times.
“to eat” occurred 686 times.
“eat food” occurred 42 times.

To estimate the probability of “food” given “eat”, we just divide:

Example: `P(food | eat)`

= \frac{C(eat \; food)}{C(eat)} = \frac{42}{746} ≈ 0.056

This means: about 5.6% of the time, when people say “eat”, the next word is “food”.

6. From Sentence to Probability

You can now estimate the probability of entire sequences.

Let’s say we want to compute the probability of:

“I want Chinese food”

Assuming <s> is the start token and </s> is the end:

P(<s> \; I \; want \; Chinese \; food \; </s>) = P(I | <s>) \times P(want | I) \times P(Chinese | want) \times P(food | Chinese) \times P(</s> | food)

From real BeRP data, we might get:

= 0.25 \times 0.33 \times 0.0011 \times 0.5 \times 0.68 ≈ 0.000031

The lower the final value, the less likely this full sequence is under that model. In practice, we’d use log-probabilities to avoid underflow in long sequences.

7. Key Takeaways

MLE gives us a fast, intuitive way to estimate n-gram probabilities from counts.
Probabilities are computed by dividing bigram counts by unigram counts.
Even simple models can capture realistic language patterns (e.g., “want to eat” vs. “eat food”).
Models built from real data (like BeRP) give more useful predictions than toy corpora.
Bigger corpora = better estimates, especially for rare combinations.

Text Lab

What Is Emotions?

Are you sure that you got it right?

Saturday, 15 November 2025

How Do We Estimate Probabilities in N-Gram Models?

1. What Does a Language Model Do Again?

2. Enter Maximum Likelihood Estimation (MLE)

3. A Mini Example: The "I Am Sam" Corpus

Step 1: Count the bigrams

Step 2: Count the unigram (preceding word)

4. Calculate Bigram Probabilities

Example 1: `P(I | <s>)`

Example 2: `P(Sam | am)`

Example 3: `P(am | I)`

5. Scaling Up: The Berkeley Restaurant Project

Example: `P(food | eat)`

6. From Sentence to Probability

7. Key Takeaways

No comments:

Post a Comment

What Is Emotions?

Are you sure that you got it right?

Saturday, 15 November 2025

How Do We Estimate Probabilities in N-Gram Models?

1. What Does a Language Model Do Again?

2. Enter Maximum Likelihood Estimation (MLE)

3. A Mini Example: The "I Am Sam" Corpus

Step 1: Count the bigrams

Step 2: Count the unigram (preceding word)

4. Calculate Bigram Probabilities

Example 1: P(I | <s>)

Example 2: P(Sam | am)

Example 3: P(am | I)

5. Scaling Up: The Berkeley Restaurant Project

Example: P(food | eat)

6. From Sentence to Probability

7. Key Takeaways

No comments:

Post a Comment

Example 1: `P(I | <s>)`

Example 2: `P(Sam | am)`

Example 3: `P(am | I)`

Example: `P(food | eat)`