Tokens — AI Learning Course · Macalinao Studio

§ 01

Core ideas

WHY TOKENS

Neural networks eat numbers, not letters

A model never sees the word learning — it sees an integer like 48066. The tokenizer is the contract between human text and the math: it splits text into subword pieces from a fixed vocabulary (typically 32k–256k entries) and replaces each piece with its ID. Everything downstream — embeddings, attention, the loss — operates on those IDs.

BPE

Byte-Pair Encoding: merge what co-occurs

BPE starts from raw bytes and repeatedly merges the most frequent adjacent pair into a new vocabulary entry. After ~50k merges, common words (the, network) are single tokens while rare words shatter: tokenization → token + ization. Example: GPT-style tokenizers encode "The neural network is learning." as 6 tokens, but the misspelled "Thee neurral netwrok" costs 12 — unfamiliar strings are expensive.

BYTE-LEVEL

Byte-level BPE and SentencePiece never see 'unknown'

Working from 256 base bytes means any string — emoji, Korean, source code, base64 — can be tokenized; there is no out-of-vocabulary token. SentencePiece (used by Llama, T5) treats the space as a real symbol ▁, so it tokenizes raw text with no pre-splitting step and round-trips exactly.

COST

Token count is the bill you actually pay

Context windows, API pricing, and generation speed are all measured in tokens. English averages ~4 characters per token; code and non-Latin scripts run worse. A '128k context' is roughly 500 KB of English but far less Thai or Python. Rule of thumb: if you care about cost or latency, count tokens, not words.

CHAT TEMPLATES

The tokens you never typed

Every chat model wraps your messages in special tokens — <|im_start|>user, role markers, tool-call delimiters — defined by a chat template that ships in tokenizer_config.json. Use the wrong template and a perfectly good model babbles. This is the #1 practical tokenizer gotcha in 2026: when a downloaded model misbehaves, check the template before blaming the weights. Related curiosity: glitch tokens — vocabulary entries that barely appeared in training — can still produce bizarre outputs, and digit-splitting quirks are why models miscount the r's in strawberry.

§ 02

The lesson

Layer 0: The Token Pipeline Before a neural network can process text, the language must be converted into math. Enter any sentence below to watch how Byte-Pair Encoding (BPE) shatters text into numeric tokens, and how the Embedding Table projects those integers into massive floating-point vectors.

3. Vector Embeddings (Lookup Table)

Each discrete Token ID looks up a dense row of floats representing semantic meaning. (Showing 8 of 4096 dimensions).

§ 03

The playground.

Theory above, instrument below. This interactive panel runs live in the page — drag, type, and watch the mechanism respond.

Playground · TokensOpen full screen ↗

§ 04

Tokenization, in detail.