Tokenization and NLP Fundamentals: BPE, SentencePiece, and WordPiece
Tokenization is the process of splitting text into sub-units (tokens) that can be processed by a model. This process, which forms the foundation of modern LLMs, directly affects model performance.
What is Tokenization?
Tokenization is the first step in converting raw text into numerical representations:
"Hello world!" → ["Hello", "world", "!"] → [1234, 5678, 99]
Tokenization Levels
- Character level: Every character is a token.
- Word level: Every word is a token.
- Subword level: Words are split into smaller sub-units (the modern approach).
Word-Level Tokenization
Simple Approach
1def word_tokenize(text): 2 return text.split() 3 4# Example 5text = "Artificial intelligence is shaping the future" 6tokens = word_tokenize(text) 7# ['Artificial', 'intelligence', 'is', 'shaping', 'the', 'future']
Problems
- OOV (Out of Vocabulary): Encountering words not seen during training.
- Large Vocabulary: Managing hundreds of thousands of words is inefficient.
- Morphological Richness: In languages like Turkish, the number of word variations due to suffixes is enormous.
- Compound Words: Determining if "Artificial intelligence" should be one concept or two.
Character-Level Tokenization
1def char_tokenize(text): 2 return list(text) 3 4# Example 5text = "Hello" 6tokens = char_tokenize(text) 7# ['H', 'e', 'l', 'l', 'o']
Advantages
- No OOV problem.
- Small vocabulary size (~100 characters).
Disadvantages
- Resulting sequences are very long.
- Loss of contextual meaning at the token level.
- Higher computational cost for the model.
Subword Tokenization
The choice of modern LLMs: A balance between word and character levels.
"tokenization" → ["token", "ization"] "unhappiness" → ["un", "happiness"] or ["un", "happy", "ness"]
BPE (Byte Pair Encoding)
The most widely used subword tokenization algorithm.
BPE Algorithm
- Split text into individual characters.
- Find the most frequent pair of adjacent characters.
- Merge this pair into a new single token.
- Repeat this process until the desired vocabulary size is reached.
BPE Example
1Starting vocabulary: ['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'i', 'd'] 2Corpus: "low lower newest lowest widest" 3 4Step 1: Most frequent pair 'e' + 's' → 'es' 5Step 2: Most frequent pair 'es' + 't' → 'est' 6Step 3: Most frequent pair 'l' + 'o' → 'lo' 7Step 4: Most frequent pair 'lo' + 'w' → 'low' 8... 9 10Final Result: ['low', 'est', 'er', 'new', 'wid', ...]
BPE Implementation
1def get_stats(vocab): 2 pairs = {} 3 for word, freq in vocab.items(): 4 symbols = word.split() 5 for i in range(len(symbols) - 1): 6 pair = (symbols[i], symbols[i + 1]) 7 pairs[pair] = pairs.get(pair, 0) + freq 8 return pairs 9 10def merge_vocab(pair, vocab): 11 new_vocab = {} 12 bigram = ' '.join(pair) 13 replacement = ''.join(pair) 14 for word in vocab: 15 new_word = word.replace(bigram, replacement) 16 new_vocab[new_word] = vocab[word] 17 return new_vocab 18 19def train_bpe(corpus, num_merges): 20 vocab = get_initial_vocab(corpus) 21 22 for i in range(num_merges): 23 pairs = get_stats(vocab) 24 if not pairs: 25 break 26 best_pair = max(pairs, key=pairs.get) 27 vocab = merge_vocab(best_pair, vocab) 28 29 return vocab
WordPiece
An algorithm developed by Google and used in models like BERT.
BPE vs WordPiece
| Feature | BPE | WordPiece |
|---|---|---|
| Merge Criterion | Frequency | Likelihood |
| Prefix | None | ## (for mid-word tokens) |
| Used In | GPT, LLaMA | BERT, DistilBERT |
WordPiece Example
"tokenization" → ["token", "##ization"] "playing" → ["play", "##ing"]
SentencePiece
A language-agnostic tokenizer also developed by Google.
Features
- Language Independent: Does not assume whitespace is a word separator.
- Byte-level: Directly operates on raw text.
- BPE + Unigram: Supports multiple algorithms.
- Reversible: Perfect detokenization is possible.
SentencePiece Usage
1import sentencepiece as spm 2 3# Training the model 4spm.SentencePieceTrainer.train( 5 input='corpus.txt', 6 model_prefix='tokenizer', 7 vocab_size=32000, 8 model_type='bpe' # or 'unigram' 9) 10 11# Loading and using the model 12sp = spm.SentencePieceProcessor() 13sp.load('tokenizer.model') 14 15# Encode 16tokens = sp.encode('Hello world', out_type=str) 17# ['▁Hello', '▁world'] 18 19ids = sp.encode('Hello world', out_type=int) 20# [1234, 5678, 9012] 21 22# Decode 23text = sp.decode(ids) 24# 'Hello world'
▁ (Underscore) Symbol
SentencePiece marks the start of words with ▁:
"Hello world" → ["▁Hello", "▁world"] "New York" → ["▁New", "▁York"]
Tiktoken (OpenAI)
The specialized BPE implementation used by OpenAI.
1import tiktoken 2 3# Loading the encoder 4enc = tiktoken.encoding_for_model("gpt-4") 5 6# Encode 7tokens = enc.encode("Hello world!") 8# [12345, 67890, 999] 9 10# Decode 11text = enc.decode(tokens) 12# "Hello world!" 13 14# Check token count 15print(len(tokens)) # 3
Model-Encoder Mappings
| Model | Encoder | Vocab Size |
|---|---|---|
| GPT-4 | cl100k_base | 100,277 |
| GPT-3.5 | cl100k_base | 100,277 |
| GPT-3 | p50k_base | 50,281 |
| Codex | p50k_edit | 50,281 |
Hugging Face Tokenizers
1from transformers import AutoTokenizer 2 3# Loading the tokenizer 4tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") 5 6# Encode 7encoded = tokenizer("Hello, world!", return_tensors="pt") 8# { 9# 'input_ids': tensor([[101, 7592, 1010, 2088, 999, 102]]), 10# 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]]) 11# } 12 13# Decode 14text = tokenizer.decode(encoded['input_ids'][0]) 15# "[CLS] hello, world! [SEP]" 16 17# Token List 18tokens = tokenizer.tokenize("Hello, world!") 19# ['hello', ',', 'world', '!']
Fast Tokenizers
1from tokenizers import Tokenizer, models, trainers, pre_tokenizers 2 3# Creating a new tokenizer 4tokenizer = Tokenizer(models.BPE()) 5tokenizer.pre_tokenizer = pre_tokenizers.Whitespace() 6 7trainer = trainers.BpeTrainer( 8 vocab_size=30000, 9 special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] 10) 11 12tokenizer.train(files=["corpus.txt"], trainer=trainer) 13tokenizer.save("my_tokenizer.json")
Special Tokens
Common Special Tokens
| Token | Description | Use Case |
|---|---|---|
| [CLS] | Start of sequence | BERT classification tasks |
| [SEP] | Segment separator | Separating sentence pairs |
| [PAD] | Padding | Batch processing alignment |
| [UNK] | Unknown token | Handling out-of-vocabulary words |
| [MASK] | Mask | Masked Language Modeling (MLM) |
| <|endoftext|> | End of sequence | GPT Generative tasks |
Chat Tokens
1<|system|>You are a helpful assistant<|end|> 2<|user|>Hello!<|end|> 3<|assistant|>Hello! How can I help you today?<|end|>
Tokenization Challenges in Turkish
Morphological Richness
1"gelebileceklermiş" (they were said to be able to come) → A single word but complex structure 2gel (come) + ebil (can) + ecek (will) + ler (they) + miş (reportedly) 3 4Tokenization: 5- Poor: ["gelebileceklermiş"] (Single token, very rare) 6- Good: ["gel", "ebil", "ecek", "ler", "miş"]
Solutions
- Turkish-optimized tokenizer training.
- Integration of morphological analysis.
- Suffix-aware BPE application.
Token Limits and Management
Context Window
| Model | Context Length (Tokens) | ~Word Equivalent |
|---|---|---|
| GPT-3.5 | 16K | ~12,000 |
| GPT-4 | 128K | ~96,000 |
| Claude 3 | 200K | ~150,000 |
Token Count Estimation
1def estimate_tokens(text): 2 # Rough estimate: 1 token ≈ 4 characters (English) 3 # For Turkish: 1 token ≈ 3 characters 4 return len(text) // 3 5 6# More accurate calculation 7def count_tokens(text, model="gpt-4"): 8 enc = tiktoken.encoding_for_model(model) 9 return len(enc.encode(text))
Conclusion
Tokenization is the fundamental building block of NLP and LLMs. Subword methods like BPE, WordPiece, and SentencePiece play a critical role in the success of modern language models. Choosing and configuring the right tokenizer directly impacts the final performance of the model.
At Veni AI, we provide tokenization strategies specialized in Turkish NLP solutions.
