Field	Value	Source
Canonical Path	/blog/llm-quantization-model-optimizasyonu-int8-int4	Veni AI Blog
Primary Category	Ottimizzazione del Modello	Post Metadata
Author	Veni AI Technical Team	Post Metadata

Quantizzazione LLM e Ottimizzazione del Modello: INT8, INT4 e GPTQ

La quantizzazione è il processo di conversione dei pesi e delle attivazioni del modello in formati numerici a precisione inferiore. Questo processo riduce significativamente l’utilizzo della memoria e il tempo di inferenza.

Fondamenti della Quantizzazione

Perché la Quantizzazione?

Metrica	FP32	FP16	INT8	INT4
Bit/Parametro	32	16	8	4
Dimensione Modello 7B	28GB	14GB	7GB	3.5GB
Velocità Relativa	1x	1.5-2x	2-4x	3-5x

Formati Numerici

1FP32: 1 bit sign + 8 bit exponent + 23 bit mantissa
2FP16: 1 bit sign + 5 bit exponent + 10 bit mantissa
3BF16: 1 bit sign + 8 bit exponent + 7 bit mantissa
4INT8: 8 bit integer (-128 to 127)
5INT4: 4 bit integer (-8 to 7)

Tipi di Quantizzazione

Post-Training Quantization (PTQ)

Quantizzazione dopo l’addestramento:

1import torch
2
3def quantize_tensor(tensor, bits=8):
4    # Min-max scaling
5    min_val = tensor.min()
6    max_val = tensor.max()
7    
8    # Calculate scale and zero point
9    scale = (max_val - min_val) / (2**bits - 1)
10    zero_point = round(-min_val / scale)
11    
12    # Quantize
13    q_tensor = torch.round(tensor / scale + zero_point)
14    q_tensor = torch.clamp(q_tensor, 0, 2**bits - 1)
15    
16    return q_tensor.to(torch.uint8), scale, zero_point
17
18def dequantize_tensor(q_tensor, scale, zero_point):
19    return (q_tensor.float() - zero_point) * scale

Quantization-Aware Training (QAT)

Simulazione della quantizzazione durante l’addestramento:

1class QuantizedLinear(nn.Module):
2    def __init__(self, in_features, out_features, bits=8):
3        super().__init__()
4        self.weight = nn.Parameter(torch.randn(out_features, in_features))
5        self.bits = bits
6    
7    def forward(self, x):
8        # Fake quantization during training
9        q_weight = fake_quantize(self.weight, self.bits)
10        return F.linear(x, q_weight)
11
12def fake_quantize(tensor, bits):
13    scale = tensor.abs().max() / (2**(bits-1) - 1)
14    q = torch.round(tensor / scale)
15    q = torch.clamp(q, -2**(bits-1), 2**(bits-1) - 1)
16    return q * scale  # Straight-through estimator

GPTQ (Accurate Post-Training Quantization)

Quantizzazione layer-wise con ricostruzione ottimale:

1from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
2
3# Quantization config
4quantize_config = BaseQuantizeConfig(
5    bits=4,                     # INT4
6    group_size=128,             # Group quantization
7    desc_act=False,             # Disable activation order
8    damp_percent=0.1            # Dampening factor
9)
10
11# Model quantization
12model = AutoGPTQForCausalLM.from_pretrained(
13    "meta-llama/Llama-2-7b-hf",
14    quantize_config
15)
16
17# Quantize with calibration data
18model.quantize(calibration_data)
19
20# Save
21model.save_quantized("llama-2-7b-gptq")

Principio di Funzionamento di GPTQ

11. Per ogni layer:
2   a. Calcolo della matrice Hessiana (determina l'importanza dei pesi)
3   b. Quantizzazione dei pesi meno importanti
4   c. Aggiornamento dei pesi rimanenti (compensazione dell'errore)
5   d. Passaggio alla colonna successiva
6
72. Group quantization:
8   - 128 gruppi di pesi → 1 fattore di scala
9   - Migliore accuratezza, leggermente più memoria

AWQ (Activation-aware Weight Quantization)

Preserva i pesi importanti in base alla distribuzione delle attivazioni:

1from awq import AutoAWQForCausalLM
2
3model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
4
5quant_config = {
6    "zero_point": True,
7    "q_group_size": 128,
8    "w_bit": 4,
9    "version": "GEMM"
10}
11
12model.quantize(
13    tokenizer=tokenizer,
14    quant_config=quant_config,
15    calib_data=calibration_samples
16)
17
18model.save_quantized("llama-2-7b-awq")
19## Quantizzazione BitsAndBytes
20
21Integrazione con Hugging Face:
22
23```python
24from transformers import AutoModelForCausalLM, BitsAndBytesConfig
25import torch
26
27# 8-bit quantization
28bnb_config_8bit = BitsAndBytesConfig(
29    load_in_8bit=True,
30    llm_int8_threshold=6.0,
31    llm_int8_has_fp16_weight=False
32)
33
34# 4-bit quantization (NF4)
35bnb_config_4bit = BitsAndBytesConfig(
36    load_in_4bit=True,
37    bnb_4bit_quant_type="nf4",  # or "fp4"
38    bnb_4bit_compute_dtype=torch.bfloat16,
39    bnb_4bit_use_double_quant=True  # Nested quantization
40)
41
42model = AutoModelForCausalLM.from_pretrained(
43    "meta-llama/Llama-2-7b-hf",
44    quantization_config=bnb_config_4bit,
45    device_map="auto"
46)

llama.cpp e GGUF

Formato ottimizzato per l’inferenza su CPU:

1# Model conversion
2python convert.py llama-2-7b-hf --outfile llama-2-7b-f16.gguf --outtype f16
3
4# Quantization
5./quantize llama-2-7b-f16.gguf llama-2-7b-q4_k_m.gguf q4_k_m

Livelli di quantizzazione GGUF

Formato	Bit	Dimensione (7B)	Qualità
Q2_K	2.5	2.7GB	Bassa
Q3_K_M	3.4	3.3GB	Medio-bassa
Q4_K_M	4.5	4.1GB	Media
Q5_K_M	5.5	4.8GB	Buona
Q6_K	6.5	5.5GB	Molto buona
Q8_0	8	7.2GB	Migliore

Utilizzo di GGUF con Python

1from llama_cpp import Llama
2
3llm = Llama(
4    model_path="llama-2-7b-q4_k_m.gguf",
5    n_ctx=4096,
6    n_threads=8,
7    n_gpu_layers=35  # GPU offloading
8)
9
10output = llm(
11    "What is artificial intelligence?",
12    max_tokens=256,
13    temperature=0.7
14)

Confronto dei benchmark

Metriche di performance

1Model: Llama-2-7B
2Hardware: RTX 4090
3
4| Method | Memory | Tokens/s | Perplexity |
5|--------|--------|----------|------------|
6| FP16   | 14GB   | 45       | 5.47       |
7| INT8   | 7GB    | 82       | 5.49       |
8| GPTQ-4 | 4GB    | 125      | 5.63       |
9| AWQ-4  | 4GB    | 130      | 5.58       |
10| GGUF Q4| 4GB    | 95 (CPU) | 5.65       |

Ottimizzazione dell’inferenza

Inferenza veloce con vLLM

1from vllm import LLM, SamplingParams
2
3llm = LLM(
4    model="TheBloke/Llama-2-7B-GPTQ",
5    quantization="gptq",
6    tensor_parallel_size=2
7)
8
9sampling_params = SamplingParams(
10    temperature=0.8,
11    max_tokens=256
12)
13
14outputs = llm.generate(["Hello, "], sampling_params)

Integrazione con Flash Attention

1from transformers import AutoModelForCausalLM
2
3model = AutoModelForCausalLM.from_pretrained(
4    "meta-llama/Llama-2-7b-hf",
5    torch_dtype=torch.float16,
6    attn_implementation="flash_attention_2"
7)

Criteri di selezione

Matrice di selezione della quantizzazione

1Use Case → Recommended Method
2
3Production API (GPU available):
4  → GPTQ or AWQ (4-bit)
5
6Edge/Mobile:
7  → GGUF Q4_K_M
8
9Fine-tuning required:
10  → QLoRA (4-bit BitsAndBytes)
11
12Maximum quality:
13  → INT8 or FP16
14
15Maximum speed:
16  → AWQ + vLLM

Conclusione

La quantizzazione è una tecnica di ottimizzazione fondamentale che rende i LLM più accessibili e veloci. La scelta del metodo giusto dipende dal caso d’uso e dai vincoli hardware.

Presso Veni AI, offriamo consulenza sull’ottimizzazione dei modelli.

Quantizzazione degli LLM e Ottimizzazione dei Modelli: INT8, INT4 e GPTQ

Reference Overview