Saturday, 15 November 2025

Entropy, Cross-Entropy, and Perplexity: The Deep Math Behind Language Models

 Most people working with language models have heard of perplexity—the go-to metric for evaluating how well a model predicts text.

But underneath perplexity lies something deeper: the mathematical foundations of information theory. Concepts like entropy and cross-entropy explain not just how perplexity is computed, but why it works—and what it really measures.

In this blog post, we’ll:

  • Explain entropy using simple examples (like betting on horse races),

  • Show how cross-entropy approximates it when we don’t know the real distribution, and

  • Connect it all to perplexity as the inverse exponent of average log probability.

If you’ve ever wondered why perplexity is defined the way it is, this post is for you.

1. What Is Entropy? The Core Idea of Information

In information theory (Shannon, 1948), entropy measures uncertainty or surprise in a probability distribution.

The higher the entropy, the more uncertain or random the outcome.
The lower the entropy, the more predictable it is.

⚖️ Analogy: Horse Race Guess

Let’s say 8 horses are running in a race. You want to guess on the winner.

🎲 Case A: All horses are equally likely

HorseProbability
11/8
21/8
......
81/8
  • You have no idea who will win.

  • The outcome is maximally uncertain.

  • Entropy = 3 bits

H(X)=i=1818log218=8×18log218=log28=3H(X) = - \sum_{i=1}^{8} \frac{1}{8} \log_2 \frac{1}{8} = -8 \times \frac{1}{8} \log_2 \frac{1}{8} = \log_2 8 = 3

You’d need 3 bits to encode the winner.

🏇 Case B: One horse is favored

HorseProbability
10.5
20.25
3–80.0417
  • The outcome is more predictable.

  • Entropy drops below 3 bits—less uncertainty.

Entropy tells us the minimum average number of bits needed to encode outcomes in this distribution.

2. Entropy in Language

In language, entropy measures the average unpredictability of the next word.

  • If your model is confident what comes next (e.g., "I want to ___" → "eat"), entropy is low.

  • If many words are equally likely, entropy is high.

Formally:

H(P)=wP(w)log2P(w)H(P) = - \sum_{w} P(w) \log_2 P(w)

Where:

  • P(w)P(w) is the probability of word ww

  • The sum is over all possible words

3. Cross-Entropy: When You Don’t Know the Truth

But wait—how do we calculate entropy if we don’t know the true distribution PP of language?

That’s where cross-entropy comes in.

Suppose:

  • PP = true distribution (ideal, unknown)

  • QQ = your model's predicted distribution

Then:

H(P,Q)=wP(w)log2Q(w)H(P, Q) = - \sum_{w} P(w) \log_2 Q(w)

Cross-entropy answers:

“On average, how many bits does my model QQ use to encode samples from the true distribution PP?”

If Q=PQ = P, cross-entropy = entropy (perfect model).
If QQ is wrong, cross-entropy > entropy.

4. Connecting to Perplexity

Now let’s bring in perplexity.

Perplexity is simply:

Perplexity(W)=2H(P,Q)\text{Perplexity}(W) = 2^{H(P, Q)}

Or for a word sequence of length NN:

Perplexity(W)=21Ni=1Nlog2Q(wicontext)\text{Perplexity}(W) = 2^{ -\frac{1}{N} \sum_{i=1}^N \log_2 Q(w_i \mid \text{context}) }

Perplexity is the inverse exponential of the average log-probability of the test sequence. It measures:

  • The average number of likely next words at each step.

  • How “confused” your model is when making predictions.

Lower cross-entropy → lower perplexity → better model.

5. Visualization: Entropy, Cross-Entropy, and Perplexity

Let’s compare three scenarios using a test sentence:

“The cat sat on the mat”

ModelCross-Entropy (bits/word)Perplexity
Perfect model (Q = P)222=42^2 = 4
Good model38
Weak model532
  • The perfect model is less “perplexed”—knows the next word confidently.

  • The weak model is unsure—considers more possible words per step.

6. Why Does This Matter for Language Modeling?

Because we can’t directly access the true entropy of language (we don’t know PP), we use:

✅ A test set (as a sample from PP)
✅ Our model QQ to compute probabilities

Then, we compute cross-entropy and perplexity to measure how well QQ fits PP.

If your perplexity is high, your model isn’t capturing the structure of the language well.

7. Recap: How It All Connects

ConceptWhat It MeasuresUnitRelationship
Entropy H(P)H(P)True uncertainty of a distributionbitsLower = more predictable
Cross-Entropy H(P,Q)H(P, Q)Average encoding cost with model QQbits≥ Entropy
PerplexityExponential of cross-entropy"Branching factor"2H(P,Q)2^{H(P,Q)}

8. Bonus: Cross-Entropy Loss in Deep Learning

In deep learning (e.g., training BERT or GPT), we often minimize:

Cross-Entropy Loss=P(w)logQ(w)\text{Cross-Entropy Loss} = -\sum P(w) \log Q(w)

This is exactly the same idea—just applied to one-hot target vectors and predicted softmax outputs. So:

If you’re minimizing cross-entropy loss during training, you’re really minimizing expected code length and perplexity.

Conclusion

Behind every perplexity score lies a story of information—the bits, the guesses, the uncertainty.

  • Entropy tells us how unpredictable language is.

  • Cross-entropy tells us how well our model predicts it.

  • Perplexity translates all that into a simple score we can optimize.

So the next time you measure a model’s perplexity, remember: you're not just crunching numbers—you’re measuring how well your model understands the fundamental structure of human language.

No comments:

Post a Comment