Field	Value	Source
Canonical Path	/blog/tokenization-nlp-temelleri-bpe-sentencepiece	Veni AI Blog
Primary Category	NLP	Post Metadata
Author	Veni AI Technical Team	Post Metadata

トークナイゼーションとNLPの基礎: BPE、SentencePiece、WordPiece

トークナイゼーションとは、テキストをモデルが処理できるサブユニット（トークン）に分割するプロセスです。これは現代のLLMの基盤であり、モデル性能に直接影響します。

トークナイゼーションとは？

トークナイゼーションは、生テキストを数値表現へ変換する最初のステップです。

"Hello world!" → ["Hello", "world", "!"] → [1234, 5678, 99]

トークナイゼーションのレベル

文字レベル: すべての文字がトークン。
単語レベル: すべての単語がトークン。
サブワードレベル: 単語をより小さな単位に分割（現在主流の手法）。

単語レベルトークナイゼーション

単純なアプローチ

1def word_tokenize(text):
2    return text.split()
3
4# Example
5text = "Artificial intelligence is shaping the future"
6tokens = word_tokenize(text)
7# ['Artificial', 'intelligence', 'is', 'shaping', 'the', 'future']

問題点

OOV（Out of Vocabulary）: 学習時に存在しなかった単語への対応。
語彙が巨大: 何十万もの単語を管理するのは非効率。
形態的多様性: Turkish のような言語では接尾辞による語形のバリエーションが膨大。
複合語: "Artificial intelligence" を1つの概念とするか2つに分けるか。

文字レベルトークナイゼーション

1def char_tokenize(text):
2    return list(text)
3
4# Example
5text = "Hello"
6tokens = char_tokenize(text)
7# ['H', 'e', 'l', 'l', 'o']

利点

OOV問題がない。
語彙サイズが小さい（約100文字）。

欠点

シーケンスが非常に長くなる。
トークンレベルでの文脈的意味が損なわれる。
モデルの計算コストが増加。

サブワードトークナイゼーション

現代のLLMが採用する手法：単語レベルと文字レベルのバランス。

"tokenization" → ["token", "ization"]
"unhappiness" → ["un", "happiness"] または ["un", "happy", "ness"]

BPE（Byte Pair Encoding）

最も広く使われているサブワードトークナイゼーションアルゴリズム。

BPEアルゴリズム

テキストを文字ごとに分割する。
最も頻度の高い隣接する文字ペアを見つける。
そのペアを1つのトークンとして結合する。
目的の語彙サイズに達するまで繰り返す。

BPEの例

1Starting vocabulary: ['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'i', 'd']
2Corpus: "low lower newest lowest widest"
3
4Step 1: Most frequent pair 'e' + 's' → 'es'
5Step 2: Most frequent pair 'es' + 't' → 'est'
6Step 3: Most frequent pair 'l' + 'o' → 'lo'
7Step 4: Most frequent pair 'lo' + 'w' → 'low'
8...
9
10Final Result: ['low', 'est', 'er', 'new', 'wid', ...]

BPE 実装

1def get_stats(vocab):
2    pairs = {}
3    for word, freq in vocab.items():
4        symbols = word.split()
5        for i in range(len(symbols) - 1):
6            pair = (symbols[i], symbols[i + 1])
7            pairs[pair] = pairs.get(pair, 0) + freq
8    return pairs
9
10def merge_vocab(pair, vocab):
11    new_vocab = {}
12    bigram = ' '.join(pair)
13    replacement = ''.join(pair)
14    for word in vocab:
15        new_word = word.replace(bigram, replacement)
16        new_vocab[new_word] = vocab[word]
17    return new_vocab
18
19def train_bpe(corpus, num_merges):
20    vocab = get_initial_vocab(corpus)
21    
22    for i in range(num_merges):
23        pairs = get_stats(vocab)
24        if not pairs:
25            break
26        best_pair = max(pairs, key=pairs.get)
27        vocab = merge_vocab(best_pair, vocab)
28    
29    return vocab

WordPiece

Google によって開発され、BERT のようなモデルで使用されるアルゴリズム。

BPE と WordPiece の比較

Feature	BPE	WordPiece
Merge Criterion	Frequency	Likelihood
Prefix	None	##（語中トークン用）
Used In	GPT, LLaMA	BERT, DistilBERT

WordPiece の例

1"tokenization" → ["token", "##ization"]
2"playing" → ["play", "##ing"]
3## SentencePiece
4
5Google によって開発された言語非依存のトークナイザー。
6
7### Features
8
9- **Language Independent:** 空白を単語区切りと仮定しない。
10- **Byte-level:** 生テキストを直接処理する。
11- **BPE + Unigram:** 複数アルゴリズムをサポート。
12- **Reversible:** 完全なデトークナイズが可能。
13
14### SentencePiece Usage
15
16```python
17import sentencepiece as spm
18
19# Training the model
20spm.SentencePieceTrainer.train(
21    input='corpus.txt',
22    model_prefix='tokenizer',
23    vocab_size=32000,
24    model_type='bpe'  # or 'unigram'
25)
26
27# Loading and using the model
28sp = spm.SentencePieceProcessor()
29sp.load('tokenizer.model')
30
31# Encode
32tokens = sp.encode('Hello world', out_type=str)
33# ['▁Hello', '▁world']
34
35ids = sp.encode('Hello world', out_type=int)
36# [1234, 5678, 9012]
37
38# Decode
39text = sp.decode(ids)
40# 'Hello world'

▁ (Underscore) Symbol

SentencePiece は単語の開始を ▁ で示す:

"Hello world" → ["▁Hello", "▁world"]
"New York" → ["▁New", "▁York"]

Tiktoken (OpenAI)

OpenAI が使用する特殊な BPE 実装。

1import tiktoken
2
3# Loading the encoder
4enc = tiktoken.encoding_for_model("gpt-4")
5
6# Encode
7tokens = enc.encode("Hello world!")
8# [12345, 67890, 999]
9
10# Decode
11text = enc.decode(tokens)
12# "Hello world!"
13
14# Check token count
15print(len(tokens))  # 3

Model-Encoder Mappings

Model	Encoder	Vocab Size
GPT-4	cl100k_base	100,277
GPT-3.5	cl100k_base	100,277
GPT-3	p50k_base	50,281
Codex	p50k_edit	50,281

Hugging Face Tokenizers

1from transformers import AutoTokenizer
2
3# Loading the tokenizer
4tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
5
6# Encode
7encoded = tokenizer("Hello, world!", return_tensors="pt")
8# {
9#   'input_ids': tensor([[101, 7592, 1010, 2088, 999, 102]]),
10#   'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])
11# }
12
13# Decode
14text = tokenizer.decode(encoded['input_ids'][0])
15# "[CLS] hello, world! [SEP]"
16
17# Token List
18tokens = tokenizer.tokenize("Hello, world!")
19# ['hello', ',', 'world', '!']

Fast Tokenizers

1from tokenizers import Tokenizer, models, trainers, pre_tokenizers
2
3# Creating a new tokenizer
4tokenizer = Tokenizer(models.BPE())
5tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
6
7trainer = trainers.BpeTrainer(
8    vocab_size=30000,
9    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
10)
11
12tokenizer.train(files=["corpus.txt"], trainer=trainer)
13tokenizer.save("my_tokenizer.json")

Special Tokens

Common Special Tokens

Token	Description	Use Case
[CLS]	シーケンス開始	BERT の分類タスク
[SEP]	セグメント区切り	文ペアの分離
[PAD]	パディング	バッチ処理整列
[UNK]	未知語トークン	語彙外単語の処理
[MASK]	マスク	Masked Language Modeling (MLM)
<\|endoftext\|>	シーケンス終了	GPT の生成タスク

Chat Tokens

1<|system|>You are a helpful assistant<|end|>
2<|user|>Hello!<|end|>
3<|assistant|>Hello! How can I help you today?<|end|>

Tokenization Challenges in Turkish

Morphological Richness

1"gelebileceklermiş" (they were said to be able to come) → 1語だが複雑な構造
2gel (come) + ebil (can) + ecek (will) + ler (they) + miş (reportedly)
3
4Tokenization:
5- Poor: ["gelebileceklermiş"] (単一トークンで非常にレア)
6- Good: ["gel", "ebil", "ecek", "ler", "miş"]

Solutions

Turkish-optimized tokenizer training.
Integration of morphological analysis.
Suffix-aware BPE application.

トークン制限と管理

コンテキストウィンドウ

モデル	コンテキスト長 (トークン)	約・語数換算
GPT-3.5	16K	約 12,000
GPT-4	128K	約 96,000
Claude 3	200K	約 150,000

トークン数の見積もり

1def estimate_tokens(text):
2    # Rough estimate: 1 token ≈ 4 characters (English)
3    # For Turkish: 1 token ≈ 3 characters
4    return len(text) // 3
5
6# More accurate calculation
7def count_tokens(text, model="gpt-4"):
8    enc = tiktoken.encoding_for_model(model)
9    return len(enc.encode(text))

結論

トークン化は NLP と LLM の基本的な構成要素です。BPE、WordPiece、SentencePiece のようなサブワード方式は、最新の言語モデルの成功において重要な役割を果たしています。適切なトークナイザーを選択・設定することは、モデルの最終的なパフォーマンスに直接影響します。

Veni AI では、トルコ語 NLP ソリューションに特化したトークン化戦略を提供しています。

トークナイゼーションとNLPの基礎：BPE、SentencePiece、WordPiece

Reference Overview

トークナイゼーションとNLPの基礎: BPE、SentencePiece、WordPiece

トークナイゼーションとは？

トークナイゼーションのレベル

単語レベルトークナイゼーション

単純なアプローチ

問題点

文字レベルトークナイゼーション

利点

欠点

サブワードトークナイゼーション

BPE（Byte Pair Encoding）

BPEアルゴリズム

BPEの例

BPE 実装

WordPiece

BPE と WordPiece の比較

WordPiece の例

▁ (Underscore) Symbol

Tiktoken (OpenAI)

Model-Encoder Mappings

Hugging Face Tokenizers

Fast Tokenizers

Special Tokens

Common Special Tokens

Chat Tokens

Tokenization Challenges in Turkish

Morphological Richness

Solutions

トークン制限と管理

コンテキストウィンドウ

トークン数の見積もり

結論

İlgili Makaleler

OpenClawとは何か？チャットボットを超えてAIを進化させるセルフホスト型エージェント基盤

エンタープライズAIエージェント標準：2026年初頭に浮上する運用パターン

企業向けAIガバナンス：モデルレジストリと評価基準

トークナイゼーションとNLPの基礎: BPE、SentencePiece、WordPiece

トークナイゼーションとは？

トークナイゼーションのレベル

単語レベルトークナイゼーション

単純なアプローチ

問題点

文字レベルトークナイゼーション

利点

欠点

サブワードトークナイゼーション

BPE（Byte Pair Encoding）

BPEアルゴリズム

BPEの例

BPE 実装

WordPiece

BPE と WordPiece の比較

WordPiece の例

▁ (Underscore) Symbol

Tiktoken (OpenAI)

Model-Encoder Mappings

Hugging Face Tokenizers

Fast Tokenizers

Special Tokens

Common Special Tokens

Chat Tokens

Tokenization Challenges in Turkish

Morphological Richness

Solutions

トークン制限と管理

コンテキストウィンドウ

トークン数の見積もり

結論

İlgili Makaleler

OpenClawとは何か？ チャットボットを超えてAIを進化させるセルフホスト型エージェント基盤

エンタープライズAIエージェント標準：2026年初頭に浮上する運用パターン

企業向けAIガバナンス：モデルレジストリと評価基準

OpenClawとは何か？チャットボットを超えてAIを進化させるセルフホスト型エージェント基盤