Language models aren't just for computing probabilities or scoring test sets—they can also generate language. One of the most intuitive and powerful ways to explore a language model is to sample sentences from it.
In this post, we’ll walk through how sentence sampling works, compare samples from unigrams to 4-grams, and show how generation helps reveal a model’s strengths and limitations—with a few Shakespearean twists along the way.
1. What Does It Mean to "Sample" from a Language Model?
Sampling means:
Randomly generating words one at a time, where the choice of each word is guided by the language model’s probabilities.
Think of it like rolling a weighted die at each step. The more likely the word, the larger its slice of the pie.
-
In unigram models, each word is chosen independently.
-
In bigram models, the next word depends on the last one.
-
In trigram models, it depends on the last two.
-
And so on.
Sampling lets us see what kind of text the model has learned to produce.
2. From Unigram to 4-Gram: Sampling Examples
Let’s say we train a series of n-gram models on Shakespeare’s works.
Now let’s generate sentences using each model.
✅ Unigram Model (n = 1)
“to king am enter death great and bear flourish to the of night”
-
No grammar.
-
No coherence.
-
Random salad of words.
-
Why? The model picks each word independently—no context.
✅ Bigram Model (n = 2)
“I am a king. The night is long. Enter the stage.”
-
Still shaky but better.
-
We get local word-to-word coherence (e.g., “I am”, “Enter the”).
-
But the sentence as a whole doesn’t make much sense.
✅ Trigram Model (n = 3)
“I am bound upon a wheel of fire. My tears do scald like molten lead.”
-
This is getting Shakespearean!
-
Longer-range structure appears.
-
Sentences start to look grammatical and thematic.
-
The model “remembers” 2 previous words, enabling phrase continuity.
✅ 4-Gram Model (n = 4)
“King Lear: I did her wrong. Cordelia: You have some cause, they have not.”
-
Almost indistinguishable from real text.
-
Phrases are lifted directly from the corpus.
-
But there's a catch...
⚠️ The model is memorizing, not understanding.
3. How to Sample in Practice
Here’s a simple procedure to sample from an n-gram model:
-
Start with a special token: e.g.,
<s>or<s><s>for trigrams and 4-grams. -
Choose the next word based on the model's probabilities given the current context.
-
Repeat until you hit an end-of-sentence token (e.g.,
</s>) or reach a max length.
In Python-style pseudocode:
For each word, we “roll the dice” based on what the model thinks comes next.
4. Why Sampling is So Revealing
Sampling reveals aspects that evaluation metrics (like perplexity) can’t:
| Sampling Shows... | Example |
|---|---|
| Coherence | Does the sentence make grammatical sense? |
| Fluency | Are the transitions between words smooth or clunky? |
| Repetitiveness | Does the model fall into loops (“the king the king the”)? |
| Memorization vs Generalization | Is the model just copying phrases or creating new ones? |
| Vocabulary Use | Are certain words over- or under-represented? |
Even without a formal test set, a few sampled sentences can quickly expose issues like:
-
A poorly tuned smoothing method
-
A truncated vocabulary
-
Overfitting (perfectly replicated training phrases)
5. Visualizing Sampling: A Cumulative Probability Line
When sampling a word, we often imagine a number line from 0 to 1, where each word occupies a segment proportional to its probability.
Let’s say our unigram model assigns:
-
"the": 0.08
-
"king": 0.05
-
"of": 0.04
-
...
We pick a random number (say 0.09) → it falls into the "king" segment → we output "king".
This technique is called multinomial sampling.
6. Sampling vs Search (Greedy / Beam)
It's worth noting:
-
Sampling is random: it explores different possibilities.
-
Greedy decoding chooses the highest probability word at each step.
-
Beam search keeps multiple best hypotheses at each time step.
Sampling is useful for analysis and creativity.
Search is better for production-level tasks (like translation or summarization).
7. Applications of Sampling
-
Debugging language models
-
Data augmentation (generating synthetic data)
-
Creative writing (e.g., AI poetry)
-
Dialogue systems (generating human-like responses)
-
Entertainment (e.g., AI-generated Shakespeare, Harry Potter)
8. Summary: What Sampling Tells Us
| N-Gram Order | Sampling Output | Interpretation |
|---|---|---|
| Unigram | Word soup | Model knows word frequency, not structure |
| Bigram | Local coherence | One-step memory, weak sentence flow |
| Trigram | Fluent fragments | Phrases sound natural, but may not connect |
| 4-Gram | Almost real | Fluent but overfitting—copies from training |
Conclusion
Sampling sentences from a language model is like shining a flashlight into the model’s brain. It exposes its behavior in vivid, human-readable form—whether it stutters, loops, or surprises you with brilliance.
Next time you build an n-gram model, don’t just test it with numbers—talk to it. Let it speak. You’ll learn more from one sentence than from a hundred perplexity scores.
No comments:
Post a Comment