Field	Value	Source
Canonical Path	/blog/llm-quantization-model-optimizasyonu-int8-int4	Veni AI Blog
Primary Category	Optimisation de modèles	Post Metadata
Author	Veni AI Technical Team	Post Metadata

Quantification des LLM et Optimisation des Modèles : INT8, INT4 et GPTQ

La quantification est le processus de conversion des poids et activations d’un modèle vers des formats numériques de plus faible précision. Ce processus réduit significativement l’utilisation mémoire et le temps d’inférence.

Fondamentaux de la Quantification

Pourquoi la Quantification ?

Metric	FP32	FP16	INT8	INT4
Bits/Parameter	32	16	8	4
Modèle 7B	28GB	14GB	7GB	3.5GB
Vitesse Relative	1x	1.5-2x	2-4x	3-5x

Formats Numériques

1FP32: 1 bit sign + 8 bit exponent + 23 bit mantissa
2FP16: 1 bit sign + 5 bit exponent + 10 bit mantissa
3BF16: 1 bit sign + 8 bit exponent + 7 bit mantissa
4INT8: 8 bit integer (-128 to 127)
5INT4: 4 bit integer (-8 to 7)

Types de Quantification

Quantification Après Entraînement (PTQ)

Quantification après l’entraînement :

1import torch
2
3def quantize_tensor(tensor, bits=8):
4    # Min-max scaling
5    min_val = tensor.min()
6    max_val = tensor.max()
7    
8    # Calculate scale and zero point
9    scale = (max_val - min_val) / (2**bits - 1)
10    zero_point = round(-min_val / scale)
11    
12    # Quantize
13    q_tensor = torch.round(tensor / scale + zero_point)
14    q_tensor = torch.clamp(q_tensor, 0, 2**bits - 1)
15    
16    return q_tensor.to(torch.uint8), scale, zero_point
17
18def dequantize_tensor(q_tensor, scale, zero_point):
19    return (q_tensor.float() - zero_point) * scale

Entraînement Sensible à la Quantification (QAT)

Simulation de la quantification pendant l’entraînement :

1class QuantizedLinear(nn.Module):
2    def __init__(self, in_features, out_features, bits=8):
3        super().__init__()
4        self.weight = nn.Parameter(torch.randn(out_features, in_features))
5        self.bits = bits
6    
7    def forward(self, x):
8        # Fake quantization during training
9        q_weight = fake_quantize(self.weight, self.bits)
10        return F.linear(x, q_weight)
11
12def fake_quantize(tensor, bits):
13    scale = tensor.abs().max() / (2**(bits-1) - 1)
14    q = torch.round(tensor / scale)
15    q = torch.clamp(q, -2**(bits-1), 2**(bits-1) - 1)
16    return q * scale  # Straight-through estimator

GPTQ (Quantification Précise Après Entraînement)

Quantification couche par couche avec reconstruction optimale :

1from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
2
3# Quantization config
4quantize_config = BaseQuantizeConfig(
5    bits=4,                     # INT4
6    group_size=128,             # Group quantization
7    desc_act=False,             # Disable activation order
8    damp_percent=0.1            # Dampening factor
9)
10
11# Model quantization
12model = AutoGPTQForCausalLM.from_pretrained(
13    "meta-llama/Llama-2-7b-hf",
14    quantize_config
15)
16
17# Quantize with calibration data
18model.quantize(calibration_data)
19
20# Save
21model.save_quantized("llama-2-7b-gptq")

Principe de Fonctionnement de GPTQ

11. For each layer:
2   a. Calculate Hessian matrix (determines weight importance)
3   b. Quantize least important weights
4   c. Update remaining weights (error compensation)
5   d. Move to next column
6
72. Group quantization:
8   - 128 weight groups → 1 scale factor
9   - Better accuracy, slightly more memory

AWQ (Activation-aware Weight Quantization)

Préservation des poids importants en fonction de la distribution des activations :

1from awq import AutoAWQForCausalLM
2
3model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
4
5quant_config = {
6    "zero_point": True,
7    "q_group_size": 128,
8    "w_bit": 4,
9    "version": "GEMM"
10}
11
12model.quantize(
13    tokenizer=tokenizer,
14    quant_config=quant_config,
15    calib_data=calibration_samples
16)
17
18model.save_quantized("llama-2-7b-awq")
19## Quantification BitsAndBytes
20
21Intégration Hugging Face :
22
23```python
24from transformers import AutoModelForCausalLM, BitsAndBytesConfig
25import torch
26
27# 8-bit quantization
28bnb_config_8bit = BitsAndBytesConfig(
29    load_in_8bit=True,
30    llm_int8_threshold=6.0,
31    llm_int8_has_fp16_weight=False
32)
33
34# 4-bit quantization (NF4)
35bnb_config_4bit = BitsAndBytesConfig(
36    load_in_4bit=True,
37    bnb_4bit_quant_type="nf4",  # or "fp4"
38    bnb_4bit_compute_dtype=torch.bfloat16,
39    bnb_4bit_use_double_quant=True  # Nested quantization
40)
41
42model = AutoModelForCausalLM.from_pretrained(
43    "meta-llama/Llama-2-7b-hf",
44    quantization_config=bnb_config_4bit,
45    device_map="auto"
46)

llama.cpp et GGUF

Format optimisé pour l’inférence CPU :

1# Model conversion
2python convert.py llama-2-7b-hf --outfile llama-2-7b-f16.gguf --outtype f16
3
4# Quantization
5./quantize llama-2-7b-f16.gguf llama-2-7b-q4_k_m.gguf q4_k_m

Niveaux de quantification GGUF

Format	Bits	Taille (7B)	Qualité
Q2_K	2.5	2.7GB	Faible
Q3_K_M	3.4	3.3GB	Moyen-faible
Q4_K_M	4.5	4.1GB	Moyen
Q5_K_M	5.5	4.8GB	Bon
Q6_K	6.5	5.5GB	Très bon
Q8_0	8	7.2GB	Excellent

Utilisation de GGUF avec Python

1from llama_cpp import Llama
2
3llm = Llama(
4    model_path="llama-2-7b-q4_k_m.gguf",
5    n_ctx=4096,
6    n_threads=8,
7    n_gpu_layers=35  # GPU offloading
8)
9
10output = llm(
11    "What is artificial intelligence?",
12    max_tokens=256,
13    temperature=0.7
14)

Comparaison des performances

Indicateurs de performance

1Model: Llama-2-7B
2Hardware: RTX 4090
3
4| Method | Memory | Tokens/s | Perplexity |
5|--------|--------|----------|------------|
6| FP16   | 14GB   | 45       | 5.47       |
7| INT8   | 7GB    | 82       | 5.49       |
8| GPTQ-4 | 4GB    | 125      | 5.63       |
9| AWQ-4  | 4GB    | 130      | 5.58       |
10| GGUF Q4| 4GB    | 95 (CPU) | 5.65       |

Optimisation de l’inférence

Inférence rapide avec vLLM

1from vllm import LLM, SamplingParams
2
3llm = LLM(
4    model="TheBloke/Llama-2-7B-GPTQ",
5    quantization="gptq",
6    tensor_parallel_size=2
7)
8
9sampling_params = SamplingParams(
10    temperature=0.8,
11    max_tokens=256
12)
13
14outputs = llm.generate(["Hello, "], sampling_params)

Intégration Flash Attention

1from transformers import AutoModelForCausalLM
2
3model = AutoModelForCausalLM.from_pretrained(
4    "meta-llama/Llama-2-7b-hf",
5    torch_dtype=torch.float16,
6    attn_implementation="flash_attention_2"
7)

Critères de sélection

Matrice de sélection de quantification

1Use Case → Recommended Method
2
3Production API (GPU available):
4  → GPTQ or AWQ (4-bit)
5
6Edge/Mobile:
7  → GGUF Q4_K_M
8
9Fine-tuning required:
10  → QLoRA (4-bit BitsAndBytes)
11
12Maximum quality:
13  → INT8 or FP16
14
15Maximum speed:
16  → AWQ + vLLM

Conclusion

La quantification est une technique d’optimisation essentielle qui rend les LLM plus accessibles et plus rapides. Le choix de la méthode appropriée dépend du cas d’usage et des contraintes matérielles.

Chez Veni AI, nous fournissons du conseil en optimisation de modèles.

Quantification et optimisation de modèles LLM : INT8, INT4 et GPTQ

Reference Overview