Let’s say you’ve built a language model, trained it on thousands of sentences, and you're ready to test it. But then… disaster:
Your model assigns a probability of zero to a perfectly valid sentence.
Why?
Because your model has never seen that exact word combination before. In the world of n-gram models, this is known as the zero-probability problem, and it can cripple your model’s ability to generalize.
The solution? Smoothing.
In this post, we’ll explore why smoothing is essential, and break down four classic smoothing techniques:
✔️ Add-one (Laplace)
✔️ Add-k
✔️ Interpolation
✔️ Stupid backoff
We'll also walk through formulas and give you intuitive ways to visualize what each method is doing.
1. The Zero-Probability Problem
Imagine you train a bigram model on the following corpus:
Then someone types:
“I like tacos”
Uh-oh.
The model has:
-
Seen “I like”
-
Seen “like pizza”
-
Seen “want tacos”
But never seen “like tacos”. So it assigns:
Which means:
-
The sentence "I like tacos" gets zero total probability
-
Perplexity becomes undefined (division by zero)
-
The model fails to recognize a totally valid combination
This is where smoothing comes in—to assign a small but nonzero probability to unseen events.
2. Add-One Smoothing (Laplace)
The most basic smoothing technique is add-one smoothing, also known as Laplace smoothing.
Formula:
Where:
-
: count of the bigram
-
: count of the context word
-
: vocabulary size
This method:
-
Adds 1 to every possible bigram
-
Adds V to the denominator to re-normalize
Visualization:
Imagine a histogram of bigram counts. Add-one smoothing raises the floor for every possible next word:
Now “like tacos” has a probability > 0.
But... it's not perfect:
-
Penalizes frequent events too much
-
Can distort distributions, especially with large vocabularies
3. Add-k Smoothing
A generalization of add-one is add-k smoothing, where we add a small fraction k (instead of 1) to each count.
Formula:
-
Common choices: ,
-
Keeps rare events plausible without overwhelming frequent ones
Visualization:
-
Like add-one, but with a gentler lift to unseen events
-
Prevents over-discounting common n-grams
4. Interpolation: Combining N-Gram Orders
Instead of relying only on high-order n-grams (which are more sparse), interpolation combines estimates from multiple n-gram levels.
Formula (trigram example):
Where:
-
are weights that sum to 1
-
Each term comes from a different n-gram level (tri-, bi-, unigram)
Intuition:
-
If a trigram is unseen, fall back partially to bigram and unigram
-
No hard cutoff—just a weighted blend
Visual Analogy:
Think of a 3-layer fallback system:
Advantages:
-
Smooth transition between specificity and generality
-
More robust than add-one for real-world LMs
5. Stupid Backoff: A Fast and Practical Heuristic
Proposed by Brants et al. (2007), stupid backoff is an efficient alternative for large-scale language models.
Instead of computing weighted averages, it just backs off to lower-order n-grams with a fixed discount.
Formula:
Where:
-
is the current context (e.g., trigram history)
-
is the shorter context (e.g., bigram history)
-
is a discount factor (usually 0.4)
Behavior:
-
No normalization needed
-
Not a true probability distribution
-
Very fast and scales to web-sized corpora
Visual:
Imagine falling down a ladder:
6. Smoothing in Action: Example Table
Let’s say we want to compute
| Method | Raw Count (like tacos) | Probability |
|---|---|---|
| MLE (no smoothing) | 0 | 0 |
| Add-one | 0 | ~0.0003 |
| Add-k (k=0.1) | 0 | ~0.00004 |
| Interpolation | 0 (but unigram tacos > 0) | ~0.0002 |
| Stupid backoff | 0 | 0.4 × P(tacos) |
Even if “like tacos” was never seen in training, we can still assign a small, meaningful probability.
7. Summary: Smoothing Techniques Compared
| Technique | Strengths | Weaknesses |
|---|---|---|
| Add-One (Laplace) | Simple to implement | Over-penalizes frequent words |
| Add-k | More balanced than add-one | Still heuristic; may need tuning |
| Interpolation | Blends context levels smartly | Requires devset to tune λs |
| Stupid Backoff | Extremely fast and scalable | Not a true probability; no normalization |
Conclusion
Smoothing isn’t just a technical detail—it’s what makes language modeling practical. Without it, your model can’t handle new combinations, rare phrases, or creative variations.
Whether you’re building an academic prototype or a production-scale LM, choosing the right smoothing method makes the difference between a brittle parrot and a flexible, adaptive model.
No comments:
Post a Comment