Field	Value	Source
Canonical Path	/blog/llm-quantization-model-optimizasyonu-int8-int4	Veni AI Blog
Primary Category	模型优化	Post Metadata
Author	Veni AI Technical Team	Post Metadata

LLM 量化与模型优化：INT8、INT4 和 GPTQ

量化是将模型权重和激活转换为低精度数值格式的过程。该过程显著降低内存占用并加速推理。

量化基础

为什么要进行量化？

指标	FP32	FP16	INT8	INT4
每参数位数	32	16	8	4
7B 模型大小	28GB	14GB	7GB	3.5GB
相对速度	1x	1.5-2x	2-4x	3-5x

数值格式

1FP32: 1 bit sign + 8 bit exponent + 23 bit mantissa
2FP16: 1 bit sign + 5 bit exponent + 10 bit mantissa
3BF16: 1 bit sign + 8 bit exponent + 7 bit mantissa
4INT8: 8 bit integer (-128 to 127)
5INT4: 4 bit integer (-8 to 7)

量化类型

训练后量化（PTQ）

训练完成后再进行量化：

1import torch
2
3def quantize_tensor(tensor, bits=8):
4    # Min-max scaling
5    min_val = tensor.min()
6    max_val = tensor.max()
7    
8    # Calculate scale and zero point
9    scale = (max_val - min_val) / (2**bits - 1)
10    zero_point = round(-min_val / scale)
11    
12    # Quantize
13    q_tensor = torch.round(tensor / scale + zero_point)
14    q_tensor = torch.clamp(q_tensor, 0, 2**bits - 1)
15    
16    return q_tensor.to(torch.uint8), scale, zero_point
17
18def dequantize_tensor(q_tensor, scale, zero_point):
19    return (q_tensor.float() - zero_point) * scale

量化感知训练（QAT）

在训练期间模拟量化：

1class QuantizedLinear(nn.Module):
2    def __init__(self, in_features, out_features, bits=8):
3        super().__init__()
4        self.weight = nn.Parameter(torch.randn(out_features, in_features))
5        self.bits = bits
6    
7    def forward(self, x):
8        # Fake quantization during training
9        q_weight = fake_quantize(self.weight, self.bits)
10        return F.linear(x, q_weight)
11
12def fake_quantize(tensor, bits):
13    scale = tensor.abs().max() / (2**(bits-1) - 1)
14    q = torch.round(tensor / scale)
15    q = torch.clamp(q, -2**(bits-1), 2**(bits-1) - 1)
16    return q * scale  # Straight-through estimator

GPTQ（精确训练后量化）

基于最优重建的逐层量化：

1from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
2
3# Quantization config
4quantize_config = BaseQuantizeConfig(
5    bits=4,                     # INT4
6    group_size=128,             # Group quantization
7    desc_act=False,             # Disable activation order
8    damp_percent=0.1            # Dampening factor
9)
10
11# Model quantization
12model = AutoGPTQForCausalLM.from_pretrained(
13    "meta-llama/Llama-2-7b-hf",
14    quantize_config
15)
16
17# Quantize with calibration data
18model.quantize(calibration_data)
19
20# Save
21model.save_quantized("llama-2-7b-gptq")

GPTQ 工作原理

11. 对每一层：
2   a. 计算 Hessian 矩阵（决定权重重要性）
3   b. 量化最不重要的权重
4   c. 更新剩余权重（误差补偿）
5   d. 移动到下一列
6
72. 分组量化：
8   - 128 个权重组成一组 → 1 个 scale factor
9   - 更佳精度，稍微增加内存占用

AWQ（Activation-aware Weight Quantization）

基于激活分布保留重要权重：

1from awq import AutoAWQForCausalLM
2
3model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
4
5quant_config = {
6    "zero_point": True,
7    "q_group_size": 128,
8    "w_bit": 4,
9    "version": "GEMM"
10}
11
12model.quantize(
13    tokenizer=tokenizer,
14    quant_config=quant_config,
15    calib_data=calibration_samples
16)
17
18model.save_quantized("llama-2-7b-awq")
19## BitsAndBytes 量化
20
21Hugging Face 集成：
22
23```python
24from transformers import AutoModelForCausalLM, BitsAndBytesConfig
25import torch
26
27# 8-bit quantization
28bnb_config_8bit = BitsAndBytesConfig(
29    load_in_8bit=True,
30    llm_int8_threshold=6.0,
31    llm_int8_has_fp16_weight=False
32)
33
34# 4-bit quantization (NF4)
35bnb_config_4bit = BitsAndBytesConfig(
36    load_in_4bit=True,
37    bnb_4bit_quant_type="nf4",  # or "fp4"
38    bnb_4bit_compute_dtype=torch.bfloat16,
39    bnb_4bit_use_double_quant=True  # Nested quantization
40)
41
42model = AutoModelForCausalLM.from_pretrained(
43    "meta-llama/Llama-2-7b-hf",
44    quantization_config=bnb_config_4bit,
45    device_map="auto"
46)

llama.cpp 和 GGUF

为 CPU 推理优化的格式：

1# Model conversion
2python convert.py llama-2-7b-hf --outfile llama-2-7b-f16.gguf --outtype f16
3
4# Quantization
5./quantize llama-2-7b-f16.gguf llama-2-7b-q4_k_m.gguf q4_k_m

GGUF 量化等级

Format	Bits	Size (7B)	Quality
Q2_K	2.5	2.7GB	低
Q3_K_M	3.4	3.3GB	中低
Q4_K_M	4.5	4.1GB	中等
Q5_K_M	5.5	4.8GB	良好
Q6_K	6.5	5.5GB	非常好
Q8_0	8	7.2GB	最佳

在 Python 中使用 GGUF

1from llama_cpp import Llama
2
3llm = Llama(
4    model_path="llama-2-7b-q4_k_m.gguf",
5    n_ctx=4096,
6    n_threads=8,
7    n_gpu_layers=35  # GPU offloading
8)
9
10output = llm(
11    "What is artificial intelligence?",
12    max_tokens=256,
13    temperature=0.7
14)

基准对比

性能指标

1Model: Llama-2-7B
2Hardware: RTX 4090
3
4| Method | Memory | Tokens/s | Perplexity |
5|--------|--------|----------|------------|
6| FP16   | 14GB   | 45       | 5.47       |
7| INT8   | 7GB    | 82       | 5.49       |
8| GPTQ-4 | 4GB    | 125      | 5.63       |
9| AWQ-4  | 4GB    | 130      | 5.58       |
10| GGUF Q4| 4GB    | 95 (CPU) | 5.65       |

推理优化

使用 vLLM 的快速推理

1from vllm import LLM, SamplingParams
2
3llm = LLM(
4    model="TheBloke/Llama-2-7B-GPTQ",
5    quantization="gptq",
6    tensor_parallel_size=2
7)
8
9sampling_params = SamplingParams(
10    temperature=0.8,
11    max_tokens=256
12)
13
14outputs = llm.generate(["Hello, "], sampling_params)

Flash Attention 集成

1from transformers import AutoModelForCausalLM
2
3model = AutoModelForCausalLM.from_pretrained(
4    "meta-llama/Llama-2-7b-hf",
5    torch_dtype=torch.float16,
6    attn_implementation="flash_attention_2"
7)

选择标准

量化选择矩阵

1Use Case → Recommended Method
2
3Production API (GPU available):
4  → GPTQ or AWQ (4-bit)
5
6Edge/Mobile:
7  → GGUF Q4_K_M
8
9Fine-tuning required:
10  → QLoRA (4-bit BitsAndBytes)
11
12Maximum quality:
13  → INT8 or FP16
14
15Maximum speed:
16  → AWQ + vLLM

结论

量化是一项关键的优化技术，使 LLM 更易用、更高效。选择合适的方法取决于具体使用场景和硬件限制。

在 Veni AI，我们提供模型优化方面的咨询服务。

LLM 量化与模型优化：INT8、INT4 与 GPTQ

Reference Overview

LLM 量化与模型优化：INT8、INT4 和 GPTQ

量化基础

为什么要进行量化？

数值格式

量化类型

训练后量化（PTQ）

量化感知训练（QAT）

GPTQ（精确训练后量化）

GPTQ 工作原理

AWQ（Activation-aware Weight Quantization）

llama.cpp 和 GGUF

GGUF 量化等级

在 Python 中使用 GGUF

基准对比

性能指标

推理优化

使用 vLLM 的快速推理

Flash Attention 集成

选择标准

量化选择矩阵

结论

İlgili Makaleler

什么是 OpenClaw？这套自托管的智能体基础设施正推动 AI 走向超越聊天机器人的未来

企业级 AI 代理标准：2026 年初涌现的运营模式

企业级AI治理：模型注册与评估标准