WHY TOKENSNeural networks eat numbers, not letters
A model never sees the word learning — it sees an integer like 48066. The tokenizer is the contract between human text and the math: it splits text into subword pieces from a fixed vocabulary (typically 32k–256k entries) and replaces each piece with its ID. Everything downstream — embeddings, attention, the loss — operates on those IDs.
BPEByte-Pair Encoding: merge what co-occurs
BPE starts from raw bytes and repeatedly merges the most frequent adjacent pair into a new vocabulary entry. After ~50k merges, common words (the, network) are single tokens while rare words shatter: tokenization → token + ization. Example: GPT-style tokenizers encode "The neural network is learning." as 6 tokens, but the misspelled "Thee neurral netwrok" costs 12 — unfamiliar strings are expensive.
BYTE-LEVELByte-level BPE and SentencePiece never see 'unknown'
Working from 256 base bytes means any string — emoji, Korean, source code, base64 — can be tokenized; there is no out-of-vocabulary token. SentencePiece (used by Llama, T5) treats the space as a real symbol ▁, so it tokenizes raw text with no pre-splitting step and round-trips exactly.
COSTToken count is the bill you actually pay
Context windows, API pricing, and generation speed are all measured in tokens. English averages ~4 characters per token; code and non-Latin scripts run worse. A '128k context' is roughly 500 KB of English but far less Thai or Python. Rule of thumb: if you care about cost or latency, count tokens, not words.
CHAT TEMPLATESThe tokens you never typed
Every chat model wraps your messages in special tokens — <|im_start|>user, role markers, tool-call delimiters — defined by a chat template that ships in tokenizer_config.json. Use the wrong template and a perfectly good model babbles. This is the #1 practical tokenizer gotcha in 2026: when a downloaded model misbehaves, check the template before blaming the weights. Related curiosity: glitch tokens — vocabulary entries that barely appeared in training — can still produce bizarre outputs, and digit-splitting quirks are why models miscount the r's in strawberry.