Saturday, 15 November 2025

How Do We Estimate Probabilities in N-Gram Models?

 

If we want machines to understand and generate human language, we need a way to measure how likely different word sequences are. That’s the job of a language model—to assign probabilities to sequences of words. And when it comes to n-gram models, we do this using a technique called Maximum Likelihood Estimation (MLE).

In this blog post, we’ll walk through how MLE works for n-gram models, using step-by-step calculations, mini-corpus examples, and a case study from the Berkeley Restaurant Project.

1. What Does a Language Model Do Again?

A language model helps answer questions like:

What's the probability of the sentence: “I want sushi”?

Or:

Given “I want”, what is the most likely next word?

N-gram models answer this by using statistics gathered from real-world text. But where do these probabilities come from?

2. Enter Maximum Likelihood Estimation (MLE)

MLE is a simple idea:

Estimate a probability by counting how often something happens, and divide it by how many opportunities it had to happen.

For a bigram model:

  • Estimate the probability of a word wn given the previous word wn-1 like this:

P(wnwn1)=C(wn1  wn)C(wn1)P(wn \mid wn-1) = \frac{C(wn-1 \; wn)}{C(wn-1)}

Where:

  • C(wn1  wn)C(wn-1 \; wn) = Count of the bigram (wn-1 wn)

  • C(wn1)C(wn-1) = Count of the preceding word

This is MLE: we use the relative frequency from our training corpus to estimate the probability.

3. A Mini Example: The "I Am Sam" Corpus

Let’s use a toy dataset with three short sentences:

<s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s>

We first preprocess the sentences by adding special tokens:

  • <s> = sentence start

  • </s> = sentence end

Step 1: Count the bigrams

We extract all adjacent word pairs (bigrams):

BigramCount
<s> I2
I am2
am Sam2
Sam </s>1
Sam I1
do not1
not like1
like green1
green eggs1
eggs and1
and ham1
ham </s>1

Step 2: Count the unigram (preceding word)

We also count how many times each preceding word occurs as the first part of a bigram.

WordCount
<s>3
I3
am2
Sam2

4. Calculate Bigram Probabilities

Using MLE:

Example 1: P(I | <s>)

=C(<s>  I)C(<s>)=230.67= \frac{C(<s> \; I)}{C(<s>)} = \frac{2}{3} ≈ 0.67

Example 2: P(Sam | am)

=C(am  Sam)C(am)=22=1.0= \frac{C(am \; Sam)}{C(am)} = \frac{2}{2} = 1.0

Example 3: P(am | I)

=C(I  am)C(I)=230.67= \frac{C(I \; am)}{C(I)} = \frac{2}{3} ≈ 0.67

So if your model sees “I am”, it knows that “Sam” is very likely to follow.

5. Scaling Up: The Berkeley Restaurant Project

The Berkeley Restaurant Project (BeRP) was a real-world dialog system that answered restaurant queries like:

“Tell me about a good Chinese place nearby.”

The corpus contains thousands of user sentences. Let's look at a simplified table of bigram counts from a small vocabulary of 8 words:

IwanttoeatChinesefoodlunchspend
I5827090002
want2060816651
to204686206211

These raw counts tell us, for example:

  • “want to” occurred 608 times.

  • “to eat” occurred 686 times.

  • “eat food” occurred 42 times.

To estimate the probability of “food” given “eat”, we just divide:

Example: P(food | eat)

=C(eat  food)C(eat)=427460.056= \frac{C(eat \; food)}{C(eat)} = \frac{42}{746} ≈ 0.056

This means: about 5.6% of the time, when people say “eat”, the next word is “food”.

6. From Sentence to Probability

You can now estimate the probability of entire sequences.

Let’s say we want to compute the probability of:

“I want Chinese food”

Assuming <s> is the start token and </s> is the end:

P(<s>  I  want  Chinese  food  </s>)=P(I<s>)×P(wantI)×P(Chinesewant)×P(foodChinese)×P(</s>food)P(<s> \; I \; want \; Chinese \; food \; </s>) = P(I | <s>) \times P(want | I) \times P(Chinese | want) \times P(food | Chinese) \times P(</s> | food)

From real BeRP data, we might get:

=0.25×0.33×0.0011×0.5×0.680.000031= 0.25 \times 0.33 \times 0.0011 \times 0.5 \times 0.68 ≈ 0.000031

The lower the final value, the less likely this full sequence is under that model. In practice, we’d use log-probabilities to avoid underflow in long sequences.

7. Key Takeaways

  • MLE gives us a fast, intuitive way to estimate n-gram probabilities from counts.

  • Probabilities are computed by dividing bigram counts by unigram counts.

  • Even simple models can capture realistic language patterns (e.g., “want to eat” vs. “eat food”).

  • Models built from real data (like BeRP) give more useful predictions than toy corpora.

  • Bigger corpora = better estimates, especially for rare combinations.

No comments:

Post a Comment