Text Lab

Saturday, 15 November 2025

Entropy, Cross-Entropy, and Perplexity: The Deep Math Behind Language Models

Most people working with language models have heard of perplexity—the go-to metric for evaluating how well a model predicts text.

But underneath perplexity lies something deeper: the mathematical foundations of information theory. Concepts like entropy and cross-entropy explain not just how perplexity is computed, but why it works—and what it really measures.

In this blog post, we’ll:

Explain entropy using simple examples (like betting on horse races),
Show how cross-entropy approximates it when we don’t know the real distribution, and
Connect it all to perplexity as the inverse exponent of average log probability.

If you’ve ever wondered why perplexity is defined the way it is, this post is for you.

1. What Is Entropy? The Core Idea of Information

In information theory (Shannon, 1948), entropy measures uncertainty or surprise in a probability distribution.

The higher the entropy, the more uncertain or random the outcome.
The lower the entropy, the more predictable it is.

⚖️ Analogy: Horse Race Guess

Let’s say 8 horses are running in a race. You want to guess on the winner.

🎲 Case A: All horses are equally likely

Horse	Probability
1	1/8
2	1/8
...	...
8	1/8

You have no idea who will win.
The outcome is maximally uncertain.
Entropy = 3 bits

H(X) = - \sum_{i=1}^{8} \frac{1}{8} \log_2 \frac{1}{8} = -8 \times \frac{1}{8} \log_2 \frac{1}{8} = \log_2 8 = 3

You’d need 3 bits to encode the winner.

🏇 Case B: One horse is favored

Horse	Probability
1	0.5
2	0.25
3–8	0.0417

The outcome is more predictable.
Entropy drops below 3 bits—less uncertainty.

Entropy tells us the minimum average number of bits needed to encode outcomes in this distribution.

2. Entropy in Language

In language, entropy measures the average unpredictability of the next word.

If your model is confident what comes next (e.g., "I want to ___" → "eat"), entropy is low.
If many words are equally likely, entropy is high.

Formally:

H(P) = - \sum_{w} P(w) \log_2 P(w)

Where:

$P(w)$ is the probability of word $w$
The sum is over all possible words

3. Cross-Entropy: When You Don’t Know the Truth

But wait—how do we calculate entropy if we don’t know the true distribution $P$ of language?

That’s where cross-entropy comes in.

Suppose:

$P$ = true distribution (ideal, unknown)
$Q$ = your model's predicted distribution

Then:

H(P, Q) = - \sum_{w} P(w) \log_2 Q(w)

Cross-entropy answers:

“On average, how many bits does my model $Q$ use to encode samples from the true distribution $P$ ?”

If $Q = P$ , cross-entropy = entropy (perfect model).
If $Q$ is wrong, cross-entropy > entropy.

4. Connecting to Perplexity

Now let’s bring in perplexity.

Perplexity is simply:

\text{Perplexity}(W) = 2^{H(P, Q)}

Or for a word sequence of length $N$ :

\text{Perplexity}(W) = 2^{ -\frac{1}{N} \sum_{i=1}^N \log_2 Q(w_i \mid \text{context}) }

Perplexity is the inverse exponential of the average log-probability of the test sequence. It measures:

The average number of likely next words at each step.
How “confused” your model is when making predictions.

Lower cross-entropy → lower perplexity → better model.

5. Visualization: Entropy, Cross-Entropy, and Perplexity

Let’s compare three scenarios using a test sentence:

“The cat sat on the mat”

Model	Cross-Entropy (bits/word)	Perplexity
Perfect model (Q = P)	2	$2^2 = 4$
Good model	3	8
Weak model	5	32

The perfect model is less “perplexed”—knows the next word confidently.
The weak model is unsure—considers more possible words per step.

6. Why Does This Matter for Language Modeling?

Because we can’t directly access the true entropy of language (we don’t know $P$ ), we use:

✅ A test set (as a sample from $P$ )
✅ Our model $Q$ to compute probabilities

Then, we compute cross-entropy and perplexity to measure how well $Q$ fits $P$ .

If your perplexity is high, your model isn’t capturing the structure of the language well.

7. Recap: How It All Connects

Concept	What It Measures	Unit	Relationship
Entropy $H(P)$	True uncertainty of a distribution	bits	Lower = more predictable
Cross-Entropy $H(P, Q)$	Average encoding cost with model $Q$	bits	≥ Entropy
Perplexity	Exponential of cross-entropy	"Branching factor"	$2^{H(P,Q)}$

8. Bonus: Cross-Entropy Loss in Deep Learning

In deep learning (e.g., training BERT or GPT), we often minimize:

\text{Cross-Entropy Loss} = -\sum P(w) \log Q(w)

This is exactly the same idea—just applied to one-hot target vectors and predicted softmax outputs. So:

If you’re minimizing cross-entropy loss during training, you’re really minimizing expected code length and perplexity.

Conclusion

Behind every perplexity score lies a story of information—the bits, the guesses, the uncertainty.

Entropy tells us how unpredictable language is.
Cross-entropy tells us how well our model predicts it.
Perplexity translates all that into a simple score we can optimize.

So the next time you measure a model’s perplexity, remember: you're not just crunching numbers—you’re measuring how well your model understands the fundamental structure of human language.

The Magic of Smoothing: Solving the Zero-Probability Problem

Let’s say you’ve built a language model, trained it on thousands of sentences, and you're ready to test it. But then… disaster:

Your model assigns a probability of zero to a perfectly valid sentence.

Why?

Because your model has never seen that exact word combination before. In the world of n-gram models, this is known as the zero-probability problem, and it can cripple your model’s ability to generalize.

The solution? Smoothing.

In this post, we’ll explore why smoothing is essential, and break down four classic smoothing techniques:
✔️ Add-one (Laplace)
✔️ Add-k
✔️ Interpolation
✔️ Stupid backoff

We'll also walk through formulas and give you intuitive ways to visualize what each method is doing.

1. The Zero-Probability Problem

Imagine you train a bigram model on the following corpus:


<s> I like pizza </s>  
<s> I eat pizza </s>  
<s> I want tacos </s>

Then someone types:

“I like tacos”

Uh-oh.

The model has:

Seen “I like”
Seen “like pizza”
Seen “want tacos”

But never seen “like tacos”. So it assigns:

P(\text{"tacos"} \mid \text{"like"}) = 0

Which means:

The sentence "I like tacos" gets zero total probability
Perplexity becomes undefined (division by zero)
The model fails to recognize a totally valid combination

This is where smoothing comes in—to assign a small but nonzero probability to unseen events.

2. Add-One Smoothing (Laplace)

The most basic smoothing technique is add-one smoothing, also known as Laplace smoothing.

Formula:

P_{\text{Laplace}}(w_n \mid w_{n-1}) = \frac{C(w_{n-1}w_n) + 1}{C(w_{n-1}) + V}

Where:

$C(w_{n-1}w_n)$ : count of the bigram
$C(w_{n-1})$ : count of the context word
$V$ : vocabulary size

This method:

Adds 1 to every possible bigram
Adds V to the denominator to re-normalize

Visualization:

Imagine a histogram of bigram counts. Add-one smoothing raises the floor for every possible next word:


Original:
like → pizza: 2
like → tacos: 0

After smoothing:
like → pizza: 3
like → tacos: 1

Now “like tacos” has a probability > 0.

But... it's not perfect:

Penalizes frequent events too much
Can distort distributions, especially with large vocabularies

3. Add-k Smoothing

A generalization of add-one is add-k smoothing, where we add a small fraction k (instead of 1) to each count.

Formula:

P_{\text{Add-k}}(w_n \mid w_{n-1}) = \frac{C(w_{n-1}w_n) + k}{C(w_{n-1}) + kV}

Common choices: $k = 0.01$ , $k = 0.1$
Keeps rare events plausible without overwhelming frequent ones

Visualization:

Like add-one, but with a gentler lift to unseen events
Prevents over-discounting common n-grams

4. Interpolation: Combining N-Gram Orders

Instead of relying only on high-order n-grams (which are more sparse), interpolation combines estimates from multiple n-gram levels.

Formula (trigram example):

P_{\text{interp}}(w_n \mid w_{n-2}, w_{n-1}) = \lambda_3 P(w_n \mid w_{n-2}, w_{n-1}) + \lambda_2 P(w_n \mid w_{n-1}) + \lambda_1 P(w_n)

Where:

$\lambda_i$ are weights that sum to 1
Each term comes from a different n-gram level (tri-, bi-, unigram)

Intuition:

If a trigram is unseen, fall back partially to bigram and unigram
No hard cutoff—just a weighted blend

Visual Analogy:

Think of a 3-layer fallback system:


If trigram exists → trust it more (high λ3)
If not → trust bigram more (medium λ2)
Still no clue? → rely on unigram (low λ1)

Advantages:

Smooth transition between specificity and generality
More robust than add-one for real-world LMs

5. Stupid Backoff: A Fast and Practical Heuristic

Proposed by Brants et al. (2007), stupid backoff is an efficient alternative for large-scale language models.

Instead of computing weighted averages, it just backs off to lower-order n-grams with a fixed discount.

Formula:

S(w_i \mid h) = \begin{cases} \frac{C(h w_i)}{C(h)} & \text{if } C(h w_i) > 0 \\ \alpha \cdot S(w_i \mid h') & \text{otherwise} \end{cases}

Where:

$h$ is the current context (e.g., trigram history)
$h'$ is the shorter context (e.g., bigram history)
$\alpha$ is a discount factor (usually 0.4)

Behavior:

No normalization needed
Not a true probability distribution
Very fast and scales to web-sized corpora

Visual:

Imagine falling down a ladder:


4-gram missing → try 3-gram → try bigram → try unigram
Each fallback = smaller confidence

6. Smoothing in Action: Example Table

Let’s say we want to compute $P(\text{"tacos"} \mid \text{"like"})$

Method	Raw Count (like tacos)	Probability
MLE (no smoothing)	0	0
Add-one	0	~0.0003
Add-k (k=0.1)	0	~0.00004
Interpolation	0 (but unigram tacos > 0)	~0.0002
Stupid backoff	0	0.4 × P(tacos)

Even if “like tacos” was never seen in training, we can still assign a small, meaningful probability.

7. Summary: Smoothing Techniques Compared

Technique	Strengths	Weaknesses
Add-One (Laplace)	Simple to implement	Over-penalizes frequent words
Add-k	More balanced than add-one	Still heuristic; may need tuning
Interpolation	Blends context levels smartly	Requires devset to tune λs
Stupid Backoff	Extremely fast and scalable	Not a true probability; no normalization

Conclusion

Smoothing isn’t just a technical detail—it’s what makes language modeling practical. Without it, your model can’t handle new combinations, rare phrases, or creative variations.

Whether you’re building an academic prototype or a production-scale LM, choosing the right smoothing method makes the difference between a brittle parrot and a flexible, adaptive model.

Generalizing vs Overfitting in Language Models

When it comes to training language models, there’s a fine line between learning and memorizing.

A model that generalizes well can handle new sentences, new domains, and new speakers.
A model that overfits may perform brilliantly on training data—but falls apart in the real world.

In this post, we’ll explore the critical tension between generalization and overfitting in n-gram models, show how it plays out using samples from Shakespeare vs the Wall Street Journal, and discuss how issues like dialect, register, and genre impact performance.

1. What Is Overfitting in Language Models?

Overfitting happens when a model becomes so tailored to the training data that it loses the ability to adapt to new, unseen text.

In n-gram models, overfitting appears when:

The model has memorized specific n-gram sequences
It fails to predict reasonable alternatives not seen in training
It assigns zero probability to unseen—but valid—phrases

This is especially common in higher-order n-gram models, like 4-grams or 5-grams.

2. A Simple Example: 4-Gram Memorization

Suppose we train a 4-gram model on the sentence:

“The quick brown fox jumps”

When asked to predict the next word after “The quick brown fox”, the model confidently says “jumps”. But it may assign zero probability to:

“runs”
“sleeps”
“walks”

Why? Because it has never seen those variations. It memorized the one phrase it saw, and now assumes that’s the only possible continuation.

This is overfitting.

3. Sampling Comparison: Shakespeare vs WSJ

Let’s train two 4-gram models:

One on Shakespearean plays
One on Wall Street Journal (WSJ) articles

Now let’s generate some sample sentences.

Shakespeare 4-gram:

“Thou art not fit to live. I must be cruel only to be kind.”
“The lady doth protest too much, methinks.”

These sentences are:

Perfectly fluent
Often word-for-word matches from Shakespeare
Rich in emotion and style

But are they truly learned?

Not really—they’re memorized from a small, fixed corpus (~884,000 words total).

WSJ 4-gram:

“The Federal Reserve Board said interest rates will remain steady.”
“The stock market closed higher amid investor optimism.”

Again:

Fluent.
Factual-sounding.
Pulled from repeated financial reporting structures.

But if we tried to use the Shakespeare model to generate business news, or the WSJ model to write drama—it wouldn’t work.

Each model overfits to its genre.

4. Why Overfitting Happens More in Higher-Order N-Grams

The formula for total possible n-grams grows exponentially with vocabulary size $V$ :

\text{Possible n-grams} = V^n

If $V = 10,000$ , then:

Bigrams = 100 million
Trigrams = 1 trillion
4-grams = 10 quadrillion

Most of these n-grams never occur in even massive corpora. So when training a 4-gram model:

Most predictions are based on rare or single-occurrence phrases
The model becomes data-hungry, relying on memorized chunks

Unless the training data is extremely large and diverse, high-order n-gram models are likely to overfit.

5. The Cost of Overfitting: Poor Generalization

Let’s say you train a chatbot on formal customer emails. Now you test it on:

Casual tweets
Spoken text messages
Technical documentation

Chances are, it will:

Misinterpret slang
Struggle with unseen syntactic patterns
Produce awkward or irrelevant responses

That’s poor generalization—your model hasn’t learned the underlying structure of language, only the surface patterns of one type of text.

6. Mismatches in Dialect, Register, and Genre

Overfitting also occurs when there’s a mismatch between training and application domains.

Aspect	Example
Dialect	African American English vs Standard English
Register	Casual speech vs formal writing
Genre	News articles vs fictional dialogue

Example: Dialect mismatch

You train on:

“I am going to the store.”

Then encounter:

“I’m finna hit the store.”

An overfit model would fail here. It doesn’t know “finna” means “about to”.

Example: Genre mismatch

A model trained on WSJ headlines might stumble when analyzing Reddit posts, YouTube comments, or TikTok transcripts.

The vocabulary, structure, and social context all shift.

7. Strategies to Encourage Generalization

Here’s how to make your n-gram models less brittle:

✅ Use lower-order n-grams (bigrams/trigrams)
→ More general, less prone to memorization.

✅ Apply smoothing techniques (Laplace, interpolation, backoff)
→ Assign small probabilities to unseen n-grams.

✅ Prune rare n-grams
→ Remove overly specific sequences that aren’t helpful.

✅ Train on diverse corpora
→ Include multiple genres, dialects, and registers.

✅ Switch to subword or neural models
→ More robust to rare words and unseen forms.

8. Key Differences: Generalization vs Overfitting

	Generalization	Overfitting
Learns patterns	✅ Yes	❌ No—memorizes exact phrases
Handles new data	✅ Performs reasonably	❌ Fails on unseen domains
Uses context flexibly	✅ Predicts alternatives	❌ Predicts only what was seen
Model size needed	🔁 Moderate (lower n)	🚀 Large (higher n, more memory needed)
Useful for	Chatbots, summarization, translation	Text replication, mimicry (e.g., fan fiction)

Conclusion

Overfitting is like cramming for a test—you might ace the exam, but you haven’t really learned the material. Similarly, an overfit language model may seem brilliant in the lab, but stumbles in the wild.

The goal is balance: train a model that learns the essence of language—not just the echoes of your training set.

By sampling output, comparing across domains, and designing your model for generalization, you’ll build a system that not only speaks—but understands.

What Is Emotions?

Are you sure that you got it right?

Saturday, 15 November 2025

Entropy, Cross-Entropy, and Perplexity: The Deep Math Behind Language Models

1. What Is Entropy? The Core Idea of Information

⚖️ Analogy: Horse Race Guess

🎲 Case A: All horses are equally likely

🏇 Case B: One horse is favored

2. Entropy in Language

3. Cross-Entropy: When You Don’t Know the Truth

4. Connecting to Perplexity

5. Visualization: Entropy, Cross-Entropy, and Perplexity

6. Why Does This Matter for Language Modeling?

7. Recap: How It All Connects

8. Bonus: Cross-Entropy Loss in Deep Learning

Conclusion

The Magic of Smoothing: Solving the Zero-Probability Problem

1. The Zero-Probability Problem

2. Add-One Smoothing (Laplace)

Formula:

Visualization:

But... it's not perfect:

3. Add-k Smoothing

Formula:

Visualization:

4. Interpolation: Combining N-Gram Orders

Formula (trigram example):

Intuition:

Visual Analogy:

Advantages:

5. Stupid Backoff: A Fast and Practical Heuristic

Formula:

Behavior:

Visual:

6. Smoothing in Action: Example Table

7. Summary: Smoothing Techniques Compared

Conclusion

Generalizing vs Overfitting in Language Models

1. What Is Overfitting in Language Models?

2. A Simple Example: 4-Gram Memorization

3. Sampling Comparison: Shakespeare vs WSJ

Shakespeare 4-gram:

WSJ 4-gram:

4. Why Overfitting Happens More in Higher-Order N-Grams

5. The Cost of Overfitting: Poor Generalization

6. Mismatches in Dialect, Register, and Genre

Example: Dialect mismatch

Example: Genre mismatch

7. Strategies to Encourage Generalization

8. Key Differences: Generalization vs Overfitting

Conclusion