When we build a language model—whether it's a simple n-gram model or a massive transformer like GPT—we need to ask a fundamental question:
How good is this model at predicting language?
To answer that, we need a reliable evaluation metric—something that tells us how well our model understands the structure and flow of words.
That’s where perplexity comes in.
In this blog post, we’ll demystify the concept of perplexity, explain how it relates to probability and prediction, and show how it helps compare different language models (like unigrams, bigrams, and trigrams).
1. What is Perplexity?
Perplexity is the most widely used intrinsic evaluation metric for language models.
At its core, perplexity measures how surprised a language model is when it sees the actual words in a test set.
A lower perplexity means the model is less surprised — and therefore better at prediction.
Formally, for a test sequence of N words , the perplexity is:
2. Perplexity and Probability: The Inverse Relationship
Let’s break down that formula.
If your model assigns a high probability to the test sentence, perplexity will be low.
If your model assigns a low probability, perplexity will be high.
Example:
-
A perfect model (probability = 1): perplexity = 1 (no surprise)
-
A poor model (probability = 0.000001): perplexity = very high (very surprised)
In short:
| Probability of Test Set | Perplexity |
|---|---|
| High | Low |
| Medium | Medium |
| Low | High |
Thus, minimizing perplexity is equivalent to maximizing the test set probability.
3. A Real-World Analogy
Think of perplexity like the number of options your model is considering at each step.
-
A model with perplexity 10 is like saying: “At each word, I’m choosing from 10 equally likely candidates.”
-
A model with perplexity 100 is saying: “I have no idea—maybe it’s one of these 100 words.”
The smaller the set of “likely” options, the better your model knows the language.
4. Perplexity by N-Gram Model: A Comparison Table
Let’s look at how perplexity improves with richer context.
A simple experiment was run on a 1.5 million word Wall Street Journal test set using different n-gram models trained on a large corpus.
| Model Type | Context Used | Perplexity |
|---|---|---|
| Unigram | No context (each word alone) | 962 |
| Bigram | One previous word | 170 |
| Trigram | Two previous words | 109 |
Interpretation:
-
Unigram: Poor performance—words are guessed without context.
-
Bigram: Big improvement—uses the previous word for prediction.
-
Trigram: Even better—looks two words back.
As we increase context, our model becomes less perplexed.
5. How Perplexity is Used in Practice
Perplexity is often used to:
-
Compare different language models
-
Evaluate improvements after smoothing or interpolation
-
Tune hyperparameters (e.g., how much to weight higher-order n-grams)
-
Select between different corpora or training setups
Because it's fast to compute and doesn't require any downstream task, it's perfect for rapid iteration and benchmarking.
However, it's important to remember:
Perplexity is not everything.
A model with lower perplexity might not always perform better in real-world applications (like speech recognition or translation). But it’s a strong indicator—especially when intrinsic evaluation is needed.
6. Perplexity and Sentence Length
Longer test sets naturally have lower total probability (you’re multiplying many small numbers), so we normalize by sentence length (take the N-th root) to ensure fair comparison across different test sets.
That’s why we compute:
This gives us a per-word or per-token measure of difficulty.
7. When Perplexity Goes Wrong
Perplexity can be misleading when:
-
Test data leaks into training (data contamination)
-
Models are evaluated with different vocabularies
-
Models have different tokenization schemes (e.g., subwords vs words)
To get valid perplexity comparisons, ensure:
-
Test set is unseen and separate
-
Vocabulary is consistent
-
Tokenization method is fixed
8. Summary Table: What Perplexity Tells You
| Perplexity | Model Insight |
|---|---|
| ~1 | Perfect model (knows all test words exactly) |
| 50–100 | Typical range for good n-gram models |
| 100–1000 | Poor model or insufficient training |
| Vocab size | Like guessing randomly from all words |
Conclusion
Perplexity is a powerful and interpretable metric that tells us how good a language model is at predicting real language.
It helps researchers:
-
Spot overfitting
-
Measure progress
-
Compare models
-
Tune hyperparameters
While newer language models like BERT and GPT are evaluated with other metrics too, perplexity remains a foundational tool—especially for n-gram models and intrinsic evaluation.
So the next time you train a language model, don’t just ask: Does it work?
Ask: How perplexed is it?
No comments:
Post a Comment