Field	Value	Source
Canonical Path	/blog/tokenization-nlp-temelleri-bpe-sentencepiece	Veni AI Blog
Primary Category	NLP	Post Metadata
Author	Veni AI Technical Team	Post Metadata

Tokenization and NLP Fundamentals: BPE, SentencePiece, and WordPiece

Tokenization is the process of splitting text into sub-units (tokens) that can be processed by a model. This process, which forms the foundation of modern LLMs, directly affects model performance.

What is Tokenization?

Tokenization is the first step in converting raw text into numerical representations:

"Hello world!" → ["Hello", "world", "!"] → [1234, 5678, 99]

Tokenization Levels

Character level: Every character is a token.
Word level: Every word is a token.
Subword level: Words are split into smaller sub-units (the modern approach).

Word-Level Tokenization

Simple Approach

1def word_tokenize(text):
2    return text.split()
3
4# Example
5text = "Artificial intelligence is shaping the future"
6tokens = word_tokenize(text)
7# ['Artificial', 'intelligence', 'is', 'shaping', 'the', 'future']

Problems

OOV (Out of Vocabulary): Encountering words not seen during training.
Large Vocabulary: Managing hundreds of thousands of words is inefficient.
Morphological Richness: In languages like Turkish, the number of word variations due to suffixes is enormous.
Compound Words: Determining if "Artificial intelligence" should be one concept or two.

Character-Level Tokenization

1def char_tokenize(text):
2    return list(text)
3
4# Example
5text = "Hello"
6tokens = char_tokenize(text)
7# ['H', 'e', 'l', 'l', 'o']

Advantages

No OOV problem.
Small vocabulary size (~100 characters).

Disadvantages

Resulting sequences are very long.
Loss of contextual meaning at the token level.
Higher computational cost for the model.

Subword Tokenization

The choice of modern LLMs: A balance between word and character levels.

"tokenization" → ["token", "ization"]
"unhappiness" → ["un", "happiness"] or ["un", "happy", "ness"]

BPE (Byte Pair Encoding)

The most widely used subword tokenization algorithm.

BPE Algorithm

Split text into individual characters.
Find the most frequent pair of adjacent characters.
Merge this pair into a new single token.
Repeat this process until the desired vocabulary size is reached.

BPE Example

1Starting vocabulary: ['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'i', 'd']
2Corpus: "low lower newest lowest widest"
3
4Step 1: Most frequent pair 'e' + 's' → 'es'
5Step 2: Most frequent pair 'es' + 't' → 'est'
6Step 3: Most frequent pair 'l' + 'o' → 'lo'
7Step 4: Most frequent pair 'lo' + 'w' → 'low'
8...
9
10Final Result: ['low', 'est', 'er', 'new', 'wid', ...]

BPE Implementation

1def get_stats(vocab):
2    pairs = {}
3    for word, freq in vocab.items():
4        symbols = word.split()
5        for i in range(len(symbols) - 1):
6            pair = (symbols[i], symbols[i + 1])
7            pairs[pair] = pairs.get(pair, 0) + freq
8    return pairs
9
10def merge_vocab(pair, vocab):
11    new_vocab = {}
12    bigram = ' '.join(pair)
13    replacement = ''.join(pair)
14    for word in vocab:
15        new_word = word.replace(bigram, replacement)
16        new_vocab[new_word] = vocab[word]
17    return new_vocab
18
19def train_bpe(corpus, num_merges):
20    vocab = get_initial_vocab(corpus)
21    
22    for i in range(num_merges):
23        pairs = get_stats(vocab)
24        if not pairs:
25            break
26        best_pair = max(pairs, key=pairs.get)
27        vocab = merge_vocab(best_pair, vocab)
28    
29    return vocab

WordPiece

An algorithm developed by Google and used in models like BERT.

BPE vs WordPiece

Feature	BPE	WordPiece
Merge Criterion	Frequency	Likelihood
Prefix	None	## (for mid-word tokens)
Used In	GPT, LLaMA	BERT, DistilBERT

WordPiece Example

"tokenization" → ["token", "##ization"]
"playing" → ["play", "##ing"]

SentencePiece

A language-agnostic tokenizer also developed by Google.

Features

Language Independent: Does not assume whitespace is a word separator.
Byte-level: Directly operates on raw text.
BPE + Unigram: Supports multiple algorithms.
Reversible: Perfect detokenization is possible.

SentencePiece Usage

1import sentencepiece as spm
2
3# Training the model
4spm.SentencePieceTrainer.train(
5    input='corpus.txt',
6    model_prefix='tokenizer',
7    vocab_size=32000,
8    model_type='bpe'  # or 'unigram'
9)
10
11# Loading and using the model
12sp = spm.SentencePieceProcessor()
13sp.load('tokenizer.model')
14
15# Encode
16tokens = sp.encode('Hello world', out_type=str)
17# ['▁Hello', '▁world']
18
19ids = sp.encode('Hello world', out_type=int)
20# [1234, 5678, 9012]
21
22# Decode
23text = sp.decode(ids)
24# 'Hello world'

▁ (Underscore) Symbol

SentencePiece marks the start of words with ▁:

"Hello world" → ["▁Hello", "▁world"]
"New York" → ["▁New", "▁York"]

Tiktoken (OpenAI)

The specialized BPE implementation used by OpenAI.

1import tiktoken
2
3# Loading the encoder
4enc = tiktoken.encoding_for_model("gpt-4")
5
6# Encode
7tokens = enc.encode("Hello world!")
8# [12345, 67890, 999]
9
10# Decode
11text = enc.decode(tokens)
12# "Hello world!"
13
14# Check token count
15print(len(tokens))  # 3

Model-Encoder Mappings

Model	Encoder	Vocab Size
GPT-4	cl100k_base	100,277
GPT-3.5	cl100k_base	100,277
GPT-3	p50k_base	50,281
Codex	p50k_edit	50,281

Hugging Face Tokenizers

1from transformers import AutoTokenizer
2
3# Loading the tokenizer
4tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
5
6# Encode
7encoded = tokenizer("Hello, world!", return_tensors="pt")
8# {
9#   'input_ids': tensor([[101, 7592, 1010, 2088, 999, 102]]),
10#   'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])
11# }
12
13# Decode
14text = tokenizer.decode(encoded['input_ids'][0])
15# "[CLS] hello, world! [SEP]"
16
17# Token List
18tokens = tokenizer.tokenize("Hello, world!")
19# ['hello', ',', 'world', '!']

Fast Tokenizers

1from tokenizers import Tokenizer, models, trainers, pre_tokenizers
2
3# Creating a new tokenizer
4tokenizer = Tokenizer(models.BPE())
5tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
6
7trainer = trainers.BpeTrainer(
8    vocab_size=30000,
9    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
10)
11
12tokenizer.train(files=["corpus.txt"], trainer=trainer)
13tokenizer.save("my_tokenizer.json")

Special Tokens

Common Special Tokens

Token	Description	Use Case
[CLS]	Start of sequence	BERT classification tasks
[SEP]	Segment separator	Separating sentence pairs
[PAD]	Padding	Batch processing alignment
[UNK]	Unknown token	Handling out-of-vocabulary words
[MASK]	Mask	Masked Language Modeling (MLM)
<\|endoftext\|>	End of sequence	GPT Generative tasks

Chat Tokens

1<|system|>You are a helpful assistant<|end|>
2<|user|>Hello!<|end|>
3<|assistant|>Hello! How can I help you today?<|end|>

Tokenization Challenges in Turkish

Morphological Richness

1"gelebileceklermiş" (they were said to be able to come) → A single word but complex structure
2gel (come) + ebil (can) + ecek (will) + ler (they) + miş (reportedly)
3
4Tokenization:
5- Poor: ["gelebileceklermiş"] (Single token, very rare)
6- Good: ["gel", "ebil", "ecek", "ler", "miş"]

Solutions

Turkish-optimized tokenizer training.
Integration of morphological analysis.
Suffix-aware BPE application.

Token Limits and Management

Context Window

Model	Context Length (Tokens)	~Word Equivalent
GPT-3.5	16K	~12,000
GPT-4	128K	~96,000
Claude 3	200K	~150,000

Token Count Estimation

1def estimate_tokens(text):
2    # Rough estimate: 1 token ≈ 4 characters (English)
3    # For Turkish: 1 token ≈ 3 characters
4    return len(text) // 3
5
6# More accurate calculation
7def count_tokens(text, model="gpt-4"):
8    enc = tiktoken.encoding_for_model(model)
9    return len(enc.encode(text))

Conclusion

Tokenization is the fundamental building block of NLP and LLMs. Subword methods like BPE, WordPiece, and SentencePiece play a critical role in the success of modern language models. Choosing and configuring the right tokenizer directly impacts the final performance of the model.

At Veni AI, we provide tokenization strategies specialized in Turkish NLP solutions.