Field	Value	Source
Canonical Path	/blog/llm-quantization-model-optimizasyonu-int8-int4	Veni AI Blog
Primary Category	Modeloptimalisatie	Post Metadata
Author	Veni AI Technical Team	Post Metadata

LLM-kwantisatie en modeloptimalisatie: INT8, INT4 en GPTQ

Kwantisatie is het proces waarbij modelgewichten en activaties worden omgezet naar numerieke formaten met lagere precisie. Dit proces vermindert het geheugenverbruik en de inferentietijd aanzienlijk.

Basisprincipes van kwantisatie

Waarom kwantisatie?

Metriek	FP32	FP16	INT8	INT4
Bits/Parameter	32	16	8	4
7B Modelgrootte	28GB	14GB	7GB	3.5GB
Relatieve snelheid	1x	1.5-2x	2-4x	3-5x

Numerieke formaten

1FP32: 1 bit sign + 8 bit exponent + 23 bit mantissa
2FP16: 1 bit sign + 5 bit exponent + 10 bit mantissa
3BF16: 1 bit sign + 8 bit exponent + 7 bit mantissa
4INT8: 8 bit integer (-128 to 127)
5INT4: 4 bit integer (-8 to 7)

Kwantisatietypes

Post-Training Quantization (PTQ)

Kwantisatie na training:

1import torch
2
3def quantize_tensor(tensor, bits=8):
4    # Min-max scaling
5    min_val = tensor.min()
6    max_val = tensor.max()
7    
8    # Calculate scale and zero point
9    scale = (max_val - min_val) / (2**bits - 1)
10    zero_point = round(-min_val / scale)
11    
12    # Quantize
13    q_tensor = torch.round(tensor / scale + zero_point)
14    q_tensor = torch.clamp(q_tensor, 0, 2**bits - 1)
15    
16    return q_tensor.to(torch.uint8), scale, zero_point
17
18def dequantize_tensor(q_tensor, scale, zero_point):
19    return (q_tensor.float() - zero_point) * scale

Quantization-Aware Training (QAT)

Kwantisatiesimulatie tijdens training:

1class QuantizedLinear(nn.Module):
2    def __init__(self, in_features, out_features, bits=8):
3        super().__init__()
4        self.weight = nn.Parameter(torch.randn(out_features, in_features))
5        self.bits = bits
6    
7    def forward(self, x):
8        # Fake quantization during training
9        q_weight = fake_quantize(self.weight, self.bits)
10        return F.linear(x, q_weight)
11
12def fake_quantize(tensor, bits):
13    scale = tensor.abs().max() / (2**(bits-1) - 1)
14    q = torch.round(tensor / scale)
15    q = torch.clamp(q, -2**(bits-1), 2**(bits-1) - 1)
16    return q * scale  # Straight-through estimator

GPTQ (Accurate Post-Training Quantization)

Laag-voor-laag kwantisatie met optimale reconstructie:

1from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
2
3# Quantization config
4quantize_config = BaseQuantizeConfig(
5    bits=4,                     # INT4
6    group_size=128,             # Group quantization
7    desc_act=False,             # Disable activation order
8    damp_percent=0.1            # Dampening factor
9)
10
11# Model quantization
12model = AutoGPTQForCausalLM.from_pretrained(
13    "meta-llama/Llama-2-7b-hf",
14    quantize_config
15)
16
17# Quantize with calibration data
18model.quantize(calibration_data)
19
20# Save
21model.save_quantized("llama-2-7b-gptq")

Werking van GPTQ

11. Voor elke laag:
2   a. Bereken de Hessiaanmatrix (bepaalt gewichtsbelang)
3   b. Kwantiseer minst belangrijke gewichten
4   c. Werk resterende gewichten bij (foutcompensatie)
5   d. Ga naar de volgende kolom
6
72. Groepskwantisatie:
8   - 128 gewichtsgroepen → 1 schaalfactor
9   - Betere nauwkeurigheid, iets meer geheugen

AWQ (Activation-aware Weight Quantization)

Behoud van belangrijke gewichten op basis van activatieverdeling:

1from awq import AutoAWQForCausalLM
2
3model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
4
5quant_config = {
6    "zero_point": True,
7    "q_group_size": 128,
8    "w_bit": 4,
9    "version": "GEMM"
10}
11
12model.quantize(
13    tokenizer=tokenizer,
14    quant_config=quant_config,
15    calib_data=calibration_samples
16)
17
18model.save_quantized("llama-2-7b-awq")
19## BitsAndBytes Quantization
20
21Hugging Face-integratie:
22
23```python
24from transformers import AutoModelForCausalLM, BitsAndBytesConfig
25import torch
26
27# 8-bit quantization
28bnb_config_8bit = BitsAndBytesConfig(
29    load_in_8bit=True,
30    llm_int8_threshold=6.0,
31    llm_int8_has_fp16_weight=False
32)
33
34# 4-bit quantization (NF4)
35bnb_config_4bit = BitsAndBytesConfig(
36    load_in_4bit=True,
37    bnb_4bit_quant_type="nf4",  # or "fp4"
38    bnb_4bit_compute_dtype=torch.bfloat16,
39    bnb_4bit_use_double_quant=True  # Nested quantization
40)
41
42model = AutoModelForCausalLM.from_pretrained(
43    "meta-llama/Llama-2-7b-hf",
44    quantization_config=bnb_config_4bit,
45    device_map="auto"
46)

llama.cpp en GGUF

Formaat geoptimaliseerd voor CPU-inferentie:

1# Model conversion
2python convert.py llama-2-7b-hf --outfile llama-2-7b-f16.gguf --outtype f16
3
4# Quantization
5./quantize llama-2-7b-f16.gguf llama-2-7b-q4_k_m.gguf q4_k_m

GGUF-kwaliteitsniveaus

Formaat	Bits	Grootte (7B)	Kwaliteit
Q2_K	2.5	2.7GB	Laag
Q3_K_M	3.4	3.3GB	Midden-Laag
Q4_K_M	4.5	4.1GB	Midden
Q5_K_M	5.5	4.8GB	Goed
Q6_K	6.5	5.5GB	Zeer Goed
Q8_0	8	7.2GB	Beste

GGUF gebruiken met Python

1from llama_cpp import Llama
2
3llm = Llama(
4    model_path="llama-2-7b-q4_k_m.gguf",
5    n_ctx=4096,
6    n_threads=8,
7    n_gpu_layers=35  # GPU offloading
8)
9
10output = llm(
11    "What is artificial intelligence?",
12    max_tokens=256,
13    temperature=0.7
14)

Benchmarkvergelijking

Prestatiestatistieken

1Model: Llama-2-7B
2Hardware: RTX 4090
3
4| Method | Memory | Tokens/s | Perplexity |
5|--------|--------|----------|------------|
6| FP16   | 14GB   | 45       | 5.47       |
7| INT8   | 7GB    | 82       | 5.49       |
8| GPTQ-4 | 4GB    | 125      | 5.63       |
9| AWQ-4  | 4GB    | 130      | 5.58       |
10| GGUF Q4| 4GB    | 95 (CPU) | 5.65       |

Optimalisatie van inferentie

Snelle inferentie met vLLM

1from vllm import LLM, SamplingParams
2
3llm = LLM(
4    model="TheBloke/Llama-2-7B-GPTQ",
5    quantization="gptq",
6    tensor_parallel_size=2
7)
8
9sampling_params = SamplingParams(
10    temperature=0.8,
11    max_tokens=256
12)
13
14outputs = llm.generate(["Hello, "], sampling_params)

Flash Attention-integratie

1from transformers import AutoModelForCausalLM
2
3model = AutoModelForCausalLM.from_pretrained(
4    "meta-llama/Llama-2-7b-hf",
5    torch_dtype=torch.float16,
6    attn_implementation="flash_attention_2"
7)

Selectiecriteria

Quantization Selectiematrix

1Use Case → Recommended Method
2
3Production API (GPU available):
4  → GPTQ or AWQ (4-bit)
5
6Edge/Mobile:
7  → GGUF Q4_K_M
8
9Fine-tuning required:
10  → QLoRA (4-bit BitsAndBytes)
11
12Maximum quality:
13  → INT8 or FP16
14
15Maximum speed:
16  → AWQ + vLLM

Conclusie

Quantization is een cruciale optimalisatietechniek die LLM’s toegankelijker en sneller maakt. De juiste methode hangt af van de use case en hardwarebeperkingen.

Bij Veni AI bieden wij consultancy op het gebied van modeloptimalisatie.

LLM-kwantisatie en modeloptimalisatie: INT8, INT4 en GPTQ

Reference Overview

LLM-kwantisatie en modeloptimalisatie: INT8, INT4 en GPTQ

Basisprincipes van kwantisatie

Waarom kwantisatie?

Numerieke formaten

Kwantisatietypes

Post-Training Quantization (PTQ)

Quantization-Aware Training (QAT)

GPTQ (Accurate Post-Training Quantization)

Werking van GPTQ

AWQ (Activation-aware Weight Quantization)

llama.cpp en GGUF

GGUF-kwaliteitsniveaus

GGUF gebruiken met Python

Benchmarkvergelijking

Prestatiestatistieken

Optimalisatie van inferentie

Snelle inferentie met vLLM

Flash Attention-integratie

Selectiecriteria

Quantization Selectiematrix

Conclusie

İlgili Makaleler

Wat is OpenClaw? De zelfgehoste agentinfrastructuur die AI verder brengt dan chatbots

Standaarden voor Enterprise AI-agenten: Operationele patronen die begin 2026 opkomen

Enterprise AI-governance: normen voor modelregistratie en evaluatie