Veni AI
NLP

Tokenization and NLP Fundamentals: BPE, SentencePiece, and WordPiece

Tokenization methods in natural language processing, BPE, SentencePiece, WordPiece algorithms, and modern LLM tokenizer architectures.

Veni AI Technical TeamJanuary 7, 20256 min read
Tokenization and NLP Fundamentals: BPE, SentencePiece, and WordPiece

Tokenization and NLP Fundamentals: BPE, SentencePiece, and WordPiece

Tokenization is the process of splitting text into sub-units (tokens) that can be processed by a model. This process, which forms the foundation of modern LLMs, directly affects model performance.

What is Tokenization?

Tokenization is the first step in converting raw text into numerical representations:

"Hello world!" → ["Hello", "world", "!"] → [1234, 5678, 99]

Tokenization Levels

  1. Character level: Every character is a token.
  2. Word level: Every word is a token.
  3. Subword level: Words are split into smaller sub-units (the modern approach).

Word-Level Tokenization

Simple Approach

1def word_tokenize(text): 2 return text.split() 3 4# Example 5text = "Artificial intelligence is shaping the future" 6tokens = word_tokenize(text) 7# ['Artificial', 'intelligence', 'is', 'shaping', 'the', 'future']

Problems

  1. OOV (Out of Vocabulary): Encountering words not seen during training.
  2. Large Vocabulary: Managing hundreds of thousands of words is inefficient.
  3. Morphological Richness: In languages like Turkish, the number of word variations due to suffixes is enormous.
  4. Compound Words: Determining if "Artificial intelligence" should be one concept or two.

Character-Level Tokenization

1def char_tokenize(text): 2 return list(text) 3 4# Example 5text = "Hello" 6tokens = char_tokenize(text) 7# ['H', 'e', 'l', 'l', 'o']

Advantages

  • No OOV problem.
  • Small vocabulary size (~100 characters).

Disadvantages

  • Resulting sequences are very long.
  • Loss of contextual meaning at the token level.
  • Higher computational cost for the model.

Subword Tokenization

The choice of modern LLMs: A balance between word and character levels.

"tokenization" → ["token", "ization"] "unhappiness" → ["un", "happiness"] or ["un", "happy", "ness"]

BPE (Byte Pair Encoding)

The most widely used subword tokenization algorithm.

BPE Algorithm

  1. Split text into individual characters.
  2. Find the most frequent pair of adjacent characters.
  3. Merge this pair into a new single token.
  4. Repeat this process until the desired vocabulary size is reached.

BPE Example

1Starting vocabulary: ['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'i', 'd'] 2Corpus: "low lower newest lowest widest" 3 4Step 1: Most frequent pair 'e' + 's' → 'es' 5Step 2: Most frequent pair 'es' + 't' → 'est' 6Step 3: Most frequent pair 'l' + 'o' → 'lo' 7Step 4: Most frequent pair 'lo' + 'w' → 'low' 8... 9 10Final Result: ['low', 'est', 'er', 'new', 'wid', ...]

BPE Implementation

1def get_stats(vocab): 2 pairs = {} 3 for word, freq in vocab.items(): 4 symbols = word.split() 5 for i in range(len(symbols) - 1): 6 pair = (symbols[i], symbols[i + 1]) 7 pairs[pair] = pairs.get(pair, 0) + freq 8 return pairs 9 10def merge_vocab(pair, vocab): 11 new_vocab = {} 12 bigram = ' '.join(pair) 13 replacement = ''.join(pair) 14 for word in vocab: 15 new_word = word.replace(bigram, replacement) 16 new_vocab[new_word] = vocab[word] 17 return new_vocab 18 19def train_bpe(corpus, num_merges): 20 vocab = get_initial_vocab(corpus) 21 22 for i in range(num_merges): 23 pairs = get_stats(vocab) 24 if not pairs: 25 break 26 best_pair = max(pairs, key=pairs.get) 27 vocab = merge_vocab(best_pair, vocab) 28 29 return vocab

WordPiece

An algorithm developed by Google and used in models like BERT.

BPE vs WordPiece

FeatureBPEWordPiece
Merge CriterionFrequencyLikelihood
PrefixNone## (for mid-word tokens)
Used InGPT, LLaMABERT, DistilBERT

WordPiece Example

"tokenization" → ["token", "##ization"] "playing" → ["play", "##ing"]

SentencePiece

A language-agnostic tokenizer also developed by Google.

Features

  • Language Independent: Does not assume whitespace is a word separator.
  • Byte-level: Directly operates on raw text.
  • BPE + Unigram: Supports multiple algorithms.
  • Reversible: Perfect detokenization is possible.

SentencePiece Usage

1import sentencepiece as spm 2 3# Training the model 4spm.SentencePieceTrainer.train( 5 input='corpus.txt', 6 model_prefix='tokenizer', 7 vocab_size=32000, 8 model_type='bpe' # or 'unigram' 9) 10 11# Loading and using the model 12sp = spm.SentencePieceProcessor() 13sp.load('tokenizer.model') 14 15# Encode 16tokens = sp.encode('Hello world', out_type=str) 17# ['▁Hello', '▁world'] 18 19ids = sp.encode('Hello world', out_type=int) 20# [1234, 5678, 9012] 21 22# Decode 23text = sp.decode(ids) 24# 'Hello world'

▁ (Underscore) Symbol

SentencePiece marks the start of words with :

"Hello world" → ["▁Hello", "▁world"] "New York" → ["▁New", "▁York"]

Tiktoken (OpenAI)

The specialized BPE implementation used by OpenAI.

1import tiktoken 2 3# Loading the encoder 4enc = tiktoken.encoding_for_model("gpt-4") 5 6# Encode 7tokens = enc.encode("Hello world!") 8# [12345, 67890, 999] 9 10# Decode 11text = enc.decode(tokens) 12# "Hello world!" 13 14# Check token count 15print(len(tokens)) # 3

Model-Encoder Mappings

ModelEncoderVocab Size
GPT-4cl100k_base100,277
GPT-3.5cl100k_base100,277
GPT-3p50k_base50,281
Codexp50k_edit50,281

Hugging Face Tokenizers

1from transformers import AutoTokenizer 2 3# Loading the tokenizer 4tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") 5 6# Encode 7encoded = tokenizer("Hello, world!", return_tensors="pt") 8# { 9# 'input_ids': tensor([[101, 7592, 1010, 2088, 999, 102]]), 10# 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]]) 11# } 12 13# Decode 14text = tokenizer.decode(encoded['input_ids'][0]) 15# "[CLS] hello, world! [SEP]" 16 17# Token List 18tokens = tokenizer.tokenize("Hello, world!") 19# ['hello', ',', 'world', '!']

Fast Tokenizers

1from tokenizers import Tokenizer, models, trainers, pre_tokenizers 2 3# Creating a new tokenizer 4tokenizer = Tokenizer(models.BPE()) 5tokenizer.pre_tokenizer = pre_tokenizers.Whitespace() 6 7trainer = trainers.BpeTrainer( 8 vocab_size=30000, 9 special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] 10) 11 12tokenizer.train(files=["corpus.txt"], trainer=trainer) 13tokenizer.save("my_tokenizer.json")

Special Tokens

Common Special Tokens

TokenDescriptionUse Case
[CLS]Start of sequenceBERT classification tasks
[SEP]Segment separatorSeparating sentence pairs
[PAD]PaddingBatch processing alignment
[UNK]Unknown tokenHandling out-of-vocabulary words
[MASK]MaskMasked Language Modeling (MLM)
<|endoftext|>End of sequenceGPT Generative tasks

Chat Tokens

1<|system|>You are a helpful assistant<|end|> 2<|user|>Hello!<|end|> 3<|assistant|>Hello! How can I help you today?<|end|>

Tokenization Challenges in Turkish

Morphological Richness

1"gelebileceklermiş" (they were said to be able to come) → A single word but complex structure 2gel (come) + ebil (can) + ecek (will) + ler (they) + miş (reportedly) 3 4Tokenization: 5- Poor: ["gelebileceklermiş"] (Single token, very rare) 6- Good: ["gel", "ebil", "ecek", "ler", "miş"]

Solutions

  1. Turkish-optimized tokenizer training.
  2. Integration of morphological analysis.
  3. Suffix-aware BPE application.

Token Limits and Management

Context Window

ModelContext Length (Tokens)~Word Equivalent
GPT-3.516K~12,000
GPT-4128K~96,000
Claude 3200K~150,000

Token Count Estimation

1def estimate_tokens(text): 2 # Rough estimate: 1 token ≈ 4 characters (English) 3 # For Turkish: 1 token ≈ 3 characters 4 return len(text) // 3 5 6# More accurate calculation 7def count_tokens(text, model="gpt-4"): 8 enc = tiktoken.encoding_for_model(model) 9 return len(enc.encode(text))

Conclusion

Tokenization is the fundamental building block of NLP and LLMs. Subword methods like BPE, WordPiece, and SentencePiece play a critical role in the success of modern language models. Choosing and configuring the right tokenizer directly impacts the final performance of the model.

At Veni AI, we provide tokenization strategies specialized in Turkish NLP solutions.

İlgili Makaleler