Most people working with language models have heard of perplexity—the go-to metric for evaluating how well a model predicts text.
But underneath perplexity lies something deeper: the mathematical foundations of information theory. Concepts like entropy and cross-entropy explain not just how perplexity is computed, but why it works—and what it really measures.
In this blog post, we’ll:
-
Explain entropy using simple examples (like betting on horse races),
-
Show how cross-entropy approximates it when we don’t know the real distribution, and
-
Connect it all to perplexity as the inverse exponent of average log probability.
If you’ve ever wondered why perplexity is defined the way it is, this post is for you.
1. What Is Entropy? The Core Idea of Information
In information theory (Shannon, 1948), entropy measures uncertainty or surprise in a probability distribution.
The higher the entropy, the more uncertain or random the outcome.
The lower the entropy, the more predictable it is.
⚖️ Analogy: Horse Race Guess
Let’s say 8 horses are running in a race. You want to guess on the winner.
馃幉 Case A: All horses are equally likely
| Horse | Probability |
|---|---|
| 1 | 1/8 |
| 2 | 1/8 |
| ... | ... |
| 8 | 1/8 |
-
You have no idea who will win.
-
The outcome is maximally uncertain.
-
Entropy = 3 bits
You’d need 3 bits to encode the winner.
馃弴 Case B: One horse is favored
| Horse | Probability |
|---|---|
| 1 | 0.5 |
| 2 | 0.25 |
| 3–8 | 0.0417 |
-
The outcome is more predictable.
-
Entropy drops below 3 bits—less uncertainty.
Entropy tells us the minimum average number of bits needed to encode outcomes in this distribution.
2. Entropy in Language
In language, entropy measures the average unpredictability of the next word.
-
If your model is confident what comes next (e.g., "I want to ___" → "eat"), entropy is low.
-
If many words are equally likely, entropy is high.
Formally:
Where:
-
is the probability of word
-
The sum is over all possible words
3. Cross-Entropy: When You Don’t Know the Truth
But wait—how do we calculate entropy if we don’t know the true distribution of language?
That’s where cross-entropy comes in.
Suppose:
-
= true distribution (ideal, unknown)
-
= your model's predicted distribution
Then:
Cross-entropy answers:
“On average, how many bits does my model use to encode samples from the true distribution ?”
If , cross-entropy = entropy (perfect model).
If is wrong, cross-entropy > entropy.
4. Connecting to Perplexity
Now let’s bring in perplexity.
Perplexity is simply:
Or for a word sequence of length :
Perplexity is the inverse exponential of the average log-probability of the test sequence. It measures:
-
The average number of likely next words at each step.
-
How “confused” your model is when making predictions.
Lower cross-entropy → lower perplexity → better model.
5. Visualization: Entropy, Cross-Entropy, and Perplexity
Let’s compare three scenarios using a test sentence:
“The cat sat on the mat”
| Model | Cross-Entropy (bits/word) | Perplexity |
|---|---|---|
| Perfect model (Q = P) | 2 | |
| Good model | 3 | 8 |
| Weak model | 5 | 32 |
-
The perfect model is less “perplexed”—knows the next word confidently.
-
The weak model is unsure—considers more possible words per step.
6. Why Does This Matter for Language Modeling?
Because we can’t directly access the true entropy of language (we don’t know ), we use:
✅ A test set (as a sample from )
✅ Our model to compute probabilities
Then, we compute cross-entropy and perplexity to measure how well fits .
If your perplexity is high, your model isn’t capturing the structure of the language well.
7. Recap: How It All Connects
| Concept | What It Measures | Unit | Relationship |
|---|---|---|---|
| Entropy | True uncertainty of a distribution | bits | Lower = more predictable |
| Cross-Entropy | Average encoding cost with model | bits | ≥ Entropy |
| Perplexity | Exponential of cross-entropy | "Branching factor" |
8. Bonus: Cross-Entropy Loss in Deep Learning
In deep learning (e.g., training BERT or GPT), we often minimize:
This is exactly the same idea—just applied to one-hot target vectors and predicted softmax outputs. So:
If you’re minimizing cross-entropy loss during training, you’re really minimizing expected code length and perplexity.
Conclusion
Behind every perplexity score lies a story of information—the bits, the guesses, the uncertainty.
-
Entropy tells us how unpredictable language is.
-
Cross-entropy tells us how well our model predicts it.
-
Perplexity translates all that into a simple score we can optimize.
So the next time you measure a model’s perplexity, remember: you're not just crunching numbers—you’re measuring how well your model understands the fundamental structure of human language.
Read More