When we read a sentence, we naturally see it as a collection of words. But for a computer, language looks like one long stream of letters — no spaces, no pauses, just characters. To make sense of this, computers need to break text into smaller pieces called tokens. This process is called tokenization, and it’s the first step in almost every natural language processing (NLP) system.
So, what’s the difference between a word and a token?
A word is what we think of in everyday language — something you can look up in a dictionary, like dog or running. A token, however, is a unit that a computer uses when processing text. Usually, a token is a word, but not always! Sometimes, punctuation marks, parts of words, or even symbols like emojis are treated as separate tokens.
For example:
This sentence might be split into five tokens: I, ’m, happy, !, and ๐.
In English, separating words is fairly easy because we use spaces. But in other languages like Chinese or Thai, words are written together without spaces, so figuring out where one word ends and another begins can be much harder.
Modern AI models, like ChatGPT, go even further. They don’t just split text into words — they often break words into subwords, smaller pieces that can be combined in many ways. For example, the word unhappiness might be broken into un, happi, and ness. This helps the AI understand and generate new words it has never seen before.
Understanding words and tokens is important because it’s how computers turn human language into data they can work with. Whether it’s translating a message, answering a question, or writing a poem, everything starts with the same simple step — breaking text into tokens.
No comments:
Post a Comment