Saturday, 15 November 2025

Entropy, Cross-Entropy, and Perplexity: The Deep Math Behind Language Models

 Most people working with language models have heard of perplexity—the go-to metric for evaluating how well a model predicts text.

But underneath perplexity lies something deeper: the mathematical foundations of information theory. Concepts like entropy and cross-entropy explain not just how perplexity is computed, but why it works—and what it really measures.

In this blog post, we’ll:

  • Explain entropy using simple examples (like betting on horse races),

  • Show how cross-entropy approximates it when we don’t know the real distribution, and

  • Connect it all to perplexity as the inverse exponent of average log probability.

If you’ve ever wondered why perplexity is defined the way it is, this post is for you.

1. What Is Entropy? The Core Idea of Information

In information theory (Shannon, 1948), entropy measures uncertainty or surprise in a probability distribution.

The higher the entropy, the more uncertain or random the outcome.
The lower the entropy, the more predictable it is.

⚖️ Analogy: Horse Race Guess

Let’s say 8 horses are running in a race. You want to guess on the winner.

馃幉 Case A: All horses are equally likely

HorseProbability
11/8
21/8
......
81/8
  • You have no idea who will win.

  • The outcome is maximally uncertain.

  • Entropy = 3 bits

H(X)=i=1818log218=8×18log218=log28=3H(X) = - \sum_{i=1}^{8} \frac{1}{8} \log_2 \frac{1}{8} = -8 \times \frac{1}{8} \log_2 \frac{1}{8} = \log_2 8 = 3

You’d need 3 bits to encode the winner.

馃弴 Case B: One horse is favored

HorseProbability
10.5
20.25
3–80.0417
  • The outcome is more predictable.

  • Entropy drops below 3 bits—less uncertainty.

Entropy tells us the minimum average number of bits needed to encode outcomes in this distribution.

2. Entropy in Language

In language, entropy measures the average unpredictability of the next word.

  • If your model is confident what comes next (e.g., "I want to ___" → "eat"), entropy is low.

  • If many words are equally likely, entropy is high.

Formally:

H(P)=wP(w)log2P(w)H(P) = - \sum_{w} P(w) \log_2 P(w)

Where:

  • P(w)P(w) is the probability of word ww

  • The sum is over all possible words

3. Cross-Entropy: When You Don’t Know the Truth

But wait—how do we calculate entropy if we don’t know the true distribution PP of language?

That’s where cross-entropy comes in.

Suppose:

  • PP = true distribution (ideal, unknown)

  • QQ = your model's predicted distribution

Then:

H(P,Q)=wP(w)log2Q(w)H(P, Q) = - \sum_{w} P(w) \log_2 Q(w)

Cross-entropy answers:

“On average, how many bits does my model QQ use to encode samples from the true distribution PP?”

If Q=PQ = P, cross-entropy = entropy (perfect model).
If QQ is wrong, cross-entropy > entropy.

4. Connecting to Perplexity

Now let’s bring in perplexity.

Perplexity is simply:

Perplexity(W)=2H(P,Q)\text{Perplexity}(W) = 2^{H(P, Q)}

Or for a word sequence of length NN:

Perplexity(W)=21Ni=1Nlog2Q(wicontext)\text{Perplexity}(W) = 2^{ -\frac{1}{N} \sum_{i=1}^N \log_2 Q(w_i \mid \text{context}) }

Perplexity is the inverse exponential of the average log-probability of the test sequence. It measures:

  • The average number of likely next words at each step.

  • How “confused” your model is when making predictions.

Lower cross-entropy → lower perplexity → better model.

5. Visualization: Entropy, Cross-Entropy, and Perplexity

Let’s compare three scenarios using a test sentence:

“The cat sat on the mat”

ModelCross-Entropy (bits/word)Perplexity
Perfect model (Q = P)222=42^2 = 4
Good model38
Weak model532
  • The perfect model is less “perplexed”—knows the next word confidently.

  • The weak model is unsure—considers more possible words per step.

6. Why Does This Matter for Language Modeling?

Because we can’t directly access the true entropy of language (we don’t know PP), we use:

✅ A test set (as a sample from PP)
✅ Our model QQ to compute probabilities

Then, we compute cross-entropy and perplexity to measure how well QQ fits PP.

If your perplexity is high, your model isn’t capturing the structure of the language well.

7. Recap: How It All Connects

ConceptWhat It MeasuresUnitRelationship
Entropy H(P)H(P)True uncertainty of a distributionbitsLower = more predictable
Cross-Entropy H(P,Q)H(P, Q)Average encoding cost with model QQbits≥ Entropy
PerplexityExponential of cross-entropy"Branching factor"2H(P,Q)2^{H(P,Q)}

8. Bonus: Cross-Entropy Loss in Deep Learning

In deep learning (e.g., training BERT or GPT), we often minimize:

Cross-Entropy Loss=P(w)logQ(w)\text{Cross-Entropy Loss} = -\sum P(w) \log Q(w)

This is exactly the same idea—just applied to one-hot target vectors and predicted softmax outputs. So:

If you’re minimizing cross-entropy loss during training, you’re really minimizing expected code length and perplexity.

Conclusion

Behind every perplexity score lies a story of information—the bits, the guesses, the uncertainty.

  • Entropy tells us how unpredictable language is.

  • Cross-entropy tells us how well our model predicts it.

  • Perplexity translates all that into a simple score we can optimize.

So the next time you measure a model’s perplexity, remember: you're not just crunching numbers—you’re measuring how well your model understands the fundamental structure of human language.

Read More

The Magic of Smoothing: Solving the Zero-Probability Problem

 Let’s say you’ve built a language model, trained it on thousands of sentences, and you're ready to test it. But then… disaster:

Your model assigns a probability of zero to a perfectly valid sentence.

Why?

Because your model has never seen that exact word combination before. In the world of n-gram models, this is known as the zero-probability problem, and it can cripple your model’s ability to generalize.

The solution? Smoothing.

In this post, we’ll explore why smoothing is essential, and break down four classic smoothing techniques:
✔️ Add-one (Laplace)
✔️ Add-k
✔️ Interpolation
✔️ Stupid backoff

We'll also walk through formulas and give you intuitive ways to visualize what each method is doing.

1. The Zero-Probability Problem

Imagine you train a bigram model on the following corpus:

<s> I like pizza </s> <s> I eat pizza </s> <s> I want tacos </s>

Then someone types:

“I like tacos”

Uh-oh.

The model has:

  • Seen “I like”

  • Seen “like pizza”

  • Seen “want tacos”

But never seen “like tacos”. So it assigns:

P("tacos""like")=0P(\text{"tacos"} \mid \text{"like"}) = 0

Which means:

  • The sentence "I like tacos" gets zero total probability

  • Perplexity becomes undefined (division by zero)

  • The model fails to recognize a totally valid combination

This is where smoothing comes in—to assign a small but nonzero probability to unseen events.

2. Add-One Smoothing (Laplace)

The most basic smoothing technique is add-one smoothing, also known as Laplace smoothing.

Formula:

PLaplace(wnwn1)=C(wn1wn)+1C(wn1)+VP_{\text{Laplace}}(w_n \mid w_{n-1}) = \frac{C(w_{n-1}w_n) + 1}{C(w_{n-1}) + V}

Where:

  • C(wn1wn)C(w_{n-1}w_n): count of the bigram

  • C(wn1)C(w_{n-1}): count of the context word

  • VV: vocabulary size

This method:

  • Adds 1 to every possible bigram

  • Adds V to the denominator to re-normalize

Visualization:

Imagine a histogram of bigram counts. Add-one smoothing raises the floor for every possible next word:

Original: like pizza: 2 like tacos: 0 After smoothing: like pizza: 3 like tacos: 1

Now “like tacos” has a probability > 0.

But... it's not perfect:

  • Penalizes frequent events too much

  • Can distort distributions, especially with large vocabularies

3. Add-k Smoothing

A generalization of add-one is add-k smoothing, where we add a small fraction k (instead of 1) to each count.

Formula:

PAdd-k(wnwn1)=C(wn1wn)+kC(wn1)+kVP_{\text{Add-k}}(w_n \mid w_{n-1}) = \frac{C(w_{n-1}w_n) + k}{C(w_{n-1}) + kV}
  • Common choices: k=0.01k = 0.01, k=0.1k = 0.1

  • Keeps rare events plausible without overwhelming frequent ones

Visualization:

  • Like add-one, but with a gentler lift to unseen events

  • Prevents over-discounting common n-grams

4. Interpolation: Combining N-Gram Orders

Instead of relying only on high-order n-grams (which are more sparse), interpolation combines estimates from multiple n-gram levels.

Formula (trigram example):

Pinterp(wnwn2,wn1)=3P(wnwn2,wn1)+2P(wnwn1)+1P(wn)P_{\text{interp}}(w_n \mid w_{n-2}, w_{n-1}) = \lambda_3 P(w_n \mid w_{n-2}, w_{n-1}) + \lambda_2 P(w_n \mid w_{n-1}) + \lambda_1 P(w_n)

Where:

  • i\lambda_i are weights that sum to 1

  • Each term comes from a different n-gram level (tri-, bi-, unigram)

Intuition:

  • If a trigram is unseen, fall back partially to bigram and unigram

  • No hard cutoff—just a weighted blend

Visual Analogy:

Think of a 3-layer fallback system:

If trigram exists → trust it more (high 位3) If not → trust bigram more (medium 位2) Still no clue? → rely on unigram (low 位1)

Advantages:

  • Smooth transition between specificity and generality

  • More robust than add-one for real-world LMs

5. Stupid Backoff: A Fast and Practical Heuristic

Proposed by Brants et al. (2007), stupid backoff is an efficient alternative for large-scale language models.

Instead of computing weighted averages, it just backs off to lower-order n-grams with a fixed discount.

Formula:

S(wih)={C(hwi)C(h)if C(hwi)>0S(wih)otherwiseS(w_i \mid h) = \begin{cases} \frac{C(h w_i)}{C(h)} & \text{if } C(h w_i) > 0 \\ \alpha \cdot S(w_i \mid h') & \text{otherwise} \end{cases}

Where:

  • hh is the current context (e.g., trigram history)

  • hh' is the shorter context (e.g., bigram history)

  • \alpha is a discount factor (usually 0.4)

Behavior:

  • No normalization needed

  • Not a true probability distribution

  • Very fast and scales to web-sized corpora

Visual:

Imagine falling down a ladder:

4-gram missing → try 3-gram → try bigram → try unigram Each fallback = smaller confidence

6. Smoothing in Action: Example Table

Let’s say we want to compute P("tacos""like")P(\text{"tacos"} \mid \text{"like"})

MethodRaw Count (like tacos)Probability
MLE (no smoothing)00
Add-one0~0.0003
Add-k (k=0.1)0~0.00004
Interpolation0 (but unigram tacos > 0)~0.0002
Stupid backoff00.4 × P(tacos)

Even if “like tacos” was never seen in training, we can still assign a small, meaningful probability.

7. Summary: Smoothing Techniques Compared

TechniqueStrengthsWeaknesses
Add-One (Laplace)Simple to implementOver-penalizes frequent words
Add-kMore balanced than add-oneStill heuristic; may need tuning
InterpolationBlends context levels smartlyRequires devset to tune 位s
Stupid BackoffExtremely fast and scalableNot a true probability; no normalization

Conclusion

Smoothing isn’t just a technical detail—it’s what makes language modeling practical. Without it, your model can’t handle new combinations, rare phrases, or creative variations.

Whether you’re building an academic prototype or a production-scale LM, choosing the right smoothing method makes the difference between a brittle parrot and a flexible, adaptive model.

Read More

Generalizing vs Overfitting in Language Models

.

When it comes to training language models, there’s a fine line between learning and memorizing.

  • A model that generalizes well can handle new sentences, new domains, and new speakers.

  • A model that overfits may perform brilliantly on training data—but falls apart in the real world.

In this post, we’ll explore the critical tension between generalization and overfitting in n-gram models, show how it plays out using samples from Shakespeare vs the Wall Street Journal, and discuss how issues like dialect, register, and genre impact performance.

1. What Is Overfitting in Language Models?

Overfitting happens when a model becomes so tailored to the training data that it loses the ability to adapt to new, unseen text.

In n-gram models, overfitting appears when:

  • The model has memorized specific n-gram sequences

  • It fails to predict reasonable alternatives not seen in training

  • It assigns zero probability to unseen—but valid—phrases

This is especially common in higher-order n-gram models, like 4-grams or 5-grams.

2. A Simple Example: 4-Gram Memorization

Suppose we train a 4-gram model on the sentence:

“The quick brown fox jumps”

When asked to predict the next word after “The quick brown fox”, the model confidently says “jumps”. But it may assign zero probability to:

  • “runs”

  • “sleeps”

  • “walks”

Why? Because it has never seen those variations. It memorized the one phrase it saw, and now assumes that’s the only possible continuation.

This is overfitting.

3. Sampling Comparison: Shakespeare vs WSJ

Let’s train two 4-gram models:

  • One on Shakespearean plays

  • One on Wall Street Journal (WSJ) articles

Now let’s generate some sample sentences.

Shakespeare 4-gram:

“Thou art not fit to live. I must be cruel only to be kind.”
“The lady doth protest too much, methinks.”

These sentences are:

  • Perfectly fluent

  • Often word-for-word matches from Shakespeare

  • Rich in emotion and style

But are they truly learned?

Not really—they’re memorized from a small, fixed corpus (~884,000 words total).

WSJ 4-gram:

“The Federal Reserve Board said interest rates will remain steady.”
“The stock market closed higher amid investor optimism.”

Again:

  • Fluent.

  • Factual-sounding.

  • Pulled from repeated financial reporting structures.

But if we tried to use the Shakespeare model to generate business news, or the WSJ model to write drama—it wouldn’t work.

Each model overfits to its genre.

4. Why Overfitting Happens More in Higher-Order N-Grams

The formula for total possible n-grams grows exponentially with vocabulary size VV:

Possible n-grams=Vn\text{Possible n-grams} = V^n

If V=10,000V = 10,000, then:

  • Bigrams = 100 million

  • Trigrams = 1 trillion

  • 4-grams = 10 quadrillion

Most of these n-grams never occur in even massive corpora. So when training a 4-gram model:

  • Most predictions are based on rare or single-occurrence phrases

  • The model becomes data-hungry, relying on memorized chunks

Unless the training data is extremely large and diverse, high-order n-gram models are likely to overfit.

5. The Cost of Overfitting: Poor Generalization

Let’s say you train a chatbot on formal customer emails. Now you test it on:

  • Casual tweets

  • Spoken text messages

  • Technical documentation

Chances are, it will:

  • Misinterpret slang

  • Struggle with unseen syntactic patterns

  • Produce awkward or irrelevant responses

That’s poor generalization—your model hasn’t learned the underlying structure of language, only the surface patterns of one type of text.

6. Mismatches in Dialect, Register, and Genre

Overfitting also occurs when there’s a mismatch between training and application domains.

AspectExample
DialectAfrican American English vs Standard English
RegisterCasual speech vs formal writing
GenreNews articles vs fictional dialogue

Example: Dialect mismatch

You train on:

“I am going to the store.”

Then encounter:

“I’m finna hit the store.”

An overfit model would fail here. It doesn’t know “finna” means “about to”.

Example: Genre mismatch

A model trained on WSJ headlines might stumble when analyzing Reddit posts, YouTube comments, or TikTok transcripts.

The vocabulary, structure, and social context all shift.

7. Strategies to Encourage Generalization

Here’s how to make your n-gram models less brittle:

Use lower-order n-grams (bigrams/trigrams)
→ More general, less prone to memorization.

Apply smoothing techniques (Laplace, interpolation, backoff)
→ Assign small probabilities to unseen n-grams.

Prune rare n-grams
→ Remove overly specific sequences that aren’t helpful.

Train on diverse corpora
→ Include multiple genres, dialects, and registers.

Switch to subword or neural models
→ More robust to rare words and unseen forms.

8. Key Differences: Generalization vs Overfitting

GeneralizationOverfitting
Learns patterns✅ Yes❌ No—memorizes exact phrases
Handles new data✅ Performs reasonably❌ Fails on unseen domains
Uses context flexibly✅ Predicts alternatives❌ Predicts only what was seen
Model size needed馃攣 Moderate (lower n)馃殌 Large (higher n, more memory needed)
Useful forChatbots, summarization, translationText replication, mimicry (e.g., fan fiction)

Conclusion

Overfitting is like cramming for a test—you might ace the exam, but you haven’t really learned the material. Similarly, an overfit language model may seem brilliant in the lab, but stumbles in the wild.

The goal is balance: train a model that learns the essence of language—not just the echoes of your training set.

By sampling output, comparing across domains, and designing your model for generalization, you’ll build a system that not only speaks—but understands.

.
Read More