Text Lab: Entropy, Cross-Entropy, and Perplexity: The Deep Math Behind Language Models

Most people working with language models have heard of perplexity—the go-to metric for evaluating how well a model predicts text.

But underneath perplexity lies something deeper: the mathematical foundations of information theory. Concepts like entropy and cross-entropy explain not just how perplexity is computed, but why it works—and what it really measures.

In this blog post, we’ll:

Explain entropy using simple examples (like betting on horse races),
Show how cross-entropy approximates it when we don’t know the real distribution, and
Connect it all to perplexity as the inverse exponent of average log probability.

If you’ve ever wondered why perplexity is defined the way it is, this post is for you.

1. What Is Entropy? The Core Idea of Information

In information theory (Shannon, 1948), entropy measures uncertainty or surprise in a probability distribution.

The higher the entropy, the more uncertain or random the outcome.
The lower the entropy, the more predictable it is.

⚖️ Analogy: Horse Race Guess

Let’s say 8 horses are running in a race. You want to guess on the winner.

🎲 Case A: All horses are equally likely

Horse	Probability
1	1/8
2	1/8
...	...
8	1/8

You have no idea who will win.
The outcome is maximally uncertain.
Entropy = 3 bits

H(X) = - \sum_{i=1}^{8} \frac{1}{8} \log_2 \frac{1}{8} = -8 \times \frac{1}{8} \log_2 \frac{1}{8} = \log_2 8 = 3

You’d need 3 bits to encode the winner.

🏇 Case B: One horse is favored

Horse	Probability
1	0.5
2	0.25
3–8	0.0417

The outcome is more predictable.
Entropy drops below 3 bits—less uncertainty.

Entropy tells us the minimum average number of bits needed to encode outcomes in this distribution.

2. Entropy in Language

In language, entropy measures the average unpredictability of the next word.

If your model is confident what comes next (e.g., "I want to ___" → "eat"), entropy is low.
If many words are equally likely, entropy is high.

Formally:

H(P) = - \sum_{w} P(w) \log_2 P(w)

Where:

$P(w)$ is the probability of word $w$
The sum is over all possible words

3. Cross-Entropy: When You Don’t Know the Truth

But wait—how do we calculate entropy if we don’t know the true distribution $P$ of language?

That’s where cross-entropy comes in.

Suppose:

$P$ = true distribution (ideal, unknown)
$Q$ = your model's predicted distribution

Then:

H(P, Q) = - \sum_{w} P(w) \log_2 Q(w)

Cross-entropy answers:

“On average, how many bits does my model $Q$ use to encode samples from the true distribution $P$ ?”

If $Q = P$ , cross-entropy = entropy (perfect model).
If $Q$ is wrong, cross-entropy > entropy.

4. Connecting to Perplexity

Now let’s bring in perplexity.

Perplexity is simply:

\text{Perplexity}(W) = 2^{H(P, Q)}

Or for a word sequence of length $N$ :

\text{Perplexity}(W) = 2^{ -\frac{1}{N} \sum_{i=1}^N \log_2 Q(w_i \mid \text{context}) }

Perplexity is the inverse exponential of the average log-probability of the test sequence. It measures:

The average number of likely next words at each step.
How “confused” your model is when making predictions.

Lower cross-entropy → lower perplexity → better model.

5. Visualization: Entropy, Cross-Entropy, and Perplexity

Let’s compare three scenarios using a test sentence:

“The cat sat on the mat”

Model	Cross-Entropy (bits/word)	Perplexity
Perfect model (Q = P)	2	$2^2 = 4$
Good model	3	8
Weak model	5	32

The perfect model is less “perplexed”—knows the next word confidently.
The weak model is unsure—considers more possible words per step.

6. Why Does This Matter for Language Modeling?

Because we can’t directly access the true entropy of language (we don’t know $P$ ), we use:

✅ A test set (as a sample from $P$ )
✅ Our model $Q$ to compute probabilities

Then, we compute cross-entropy and perplexity to measure how well $Q$ fits $P$ .

If your perplexity is high, your model isn’t capturing the structure of the language well.

7. Recap: How It All Connects

Concept	What It Measures	Unit	Relationship
Entropy $H(P)$	True uncertainty of a distribution	bits	Lower = more predictable
Cross-Entropy $H(P, Q)$	Average encoding cost with model $Q$	bits	≥ Entropy
Perplexity	Exponential of cross-entropy	"Branching factor"	$2^{H(P,Q)}$

8. Bonus: Cross-Entropy Loss in Deep Learning

In deep learning (e.g., training BERT or GPT), we often minimize:

\text{Cross-Entropy Loss} = -\sum P(w) \log Q(w)

This is exactly the same idea—just applied to one-hot target vectors and predicted softmax outputs. So:

If you’re minimizing cross-entropy loss during training, you’re really minimizing expected code length and perplexity.

Conclusion

Behind every perplexity score lies a story of information—the bits, the guesses, the uncertainty.

Entropy tells us how unpredictable language is.
Cross-entropy tells us how well our model predicts it.
Perplexity translates all that into a simple score we can optimize.

So the next time you measure a model’s perplexity, remember: you're not just crunching numbers—you’re measuring how well your model understands the fundamental structure of human language.

Text Lab

What Is Emotions?

Are you sure that you got it right?

Saturday, 15 November 2025

Entropy, Cross-Entropy, and Perplexity: The Deep Math Behind Language Models