Field	Value	Source
Canonical Path	/blog/llm-quantization-model-optimizasyonu-int8-int4	Veni AI Blog
Primary Category	Optimización de modelos	Post Metadata
Author	Veni AI Technical Team	Post Metadata

Cuantización y Optimización de Modelos LLM: INT8, INT4 y GPTQ

La cuantización es el proceso de convertir los pesos y activaciones del modelo a formatos numéricos de menor precisión. Este proceso reduce significativamente el uso de memoria y el tiempo de inferencia.

Fundamentos de la Cuantización

¿Por qué Cuantizar?

Métrica	FP32	FP16	INT8	INT4
Bits/Parámetro	32	16	8	4
Tamaño del Modelo 7B	28GB	14GB	7GB	3.5GB
Velocidad Relativa	1x	1.5-2x	2-4x	3-5x

Formatos Numéricos

1FP32: 1 bit sign + 8 bit exponent + 23 bit mantissa
2FP16: 1 bit sign + 5 bit exponent + 10 bit mantissa
3BF16: 1 bit sign + 8 bit exponent + 7 bit mantissa
4INT8: 8 bit integer (-128 to 127)
5INT4: 4 bit integer (-8 to 7)

Tipos de Cuantización

Cuantización Posterior al Entrenamiento (PTQ)

Cuantización después del entrenamiento:

1import torch
2
3def quantize_tensor(tensor, bits=8):
4    # Min-max scaling
5    min_val = tensor.min()
6    max_val = tensor.max()
7    
8    # Calculate scale and zero point
9    scale = (max_val - min_val) / (2**bits - 1)
10    zero_point = round(-min_val / scale)
11    
12    # Quantize
13    q_tensor = torch.round(tensor / scale + zero_point)
14    q_tensor = torch.clamp(q_tensor, 0, 2**bits - 1)
15    
16    return q_tensor.to(torch.uint8), scale, zero_point
17
18def dequantize_tensor(q_tensor, scale, zero_point):
19    return (q_tensor.float() - zero_point) * scale

Entrenamiento Consciente de Cuantización (QAT)

Simulación de cuantización durante el entrenamiento:

1class QuantizedLinear(nn.Module):
2    def __init__(self, in_features, out_features, bits=8):
3        super().__init__()
4        self.weight = nn.Parameter(torch.randn(out_features, in_features))
5        self.bits = bits
6    
7    def forward(self, x):
8        # Fake quantization during training
9        q_weight = fake_quantize(self.weight, self.bits)
10        return F.linear(x, q_weight)
11
12def fake_quantize(tensor, bits):
13    scale = tensor.abs().max() / (2**(bits-1) - 1)
14    q = torch.round(tensor / scale)
15    q = torch.clamp(q, -2**(bits-1), 2**(bits-1) - 1)
16    return q * scale  # Straight-through estimator

GPTQ (Cuantización Precisa Posterior al Entrenamiento)

Cuantización por capas con reconstrucción óptima:

1from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
2
3# Quantization config
4quantize_config = BaseQuantizeConfig(
5    bits=4,                     # INT4
6    group_size=128,             # Group quantization
7    desc_act=False,             # Disable activation order
8    damp_percent=0.1            # Dampening factor
9)
10
11# Model quantization
12model = AutoGPTQForCausalLM.from_pretrained(
13    "meta-llama/Llama-2-7b-hf",
14    quantize_config
15)
16
17# Quantize with calibration data
18model.quantize(calibration_data)
19
20# Save
21model.save_quantized("llama-2-7b-gptq")

Principio de Funcionamiento de GPTQ

11. For each layer:
2   a. Calculate Hessian matrix (determines weight importance)
3   b. Quantize least important weights
4   c. Update remaining weights (error compensation)
5   d. Move to next column
6
72. Group quantization:
8   - 128 weight groups → 1 scale factor
9   - Better accuracy, slightly more memory

AWQ (Cuantización de Pesos Consciente de Activaciones)

Preservación de pesos importantes basada en la distribución de activaciones:

1from awq import AutoAWQForCausalLM
2
3model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
4
5quant_config = {
6    "zero_point": True,
7    "q_group_size": 128,
8    "w_bit": 4,
9    "version": "GEMM"
10}
11
12model.quantize(
13    tokenizer=tokenizer,
14    quant_config=quant_config,
15    calib_data=calibration_samples
16)
17
18model.save_quantized("llama-2-7b-awq")
19## Cuantización BitsAndBytes
20
21Integración con Hugging Face:
22
23```python
24from transformers import AutoModelForCausalLM, BitsAndBytesConfig
25import torch
26
27# 8-bit quantization
28bnb_config_8bit = BitsAndBytesConfig(
29    load_in_8bit=True,
30    llm_int8_threshold=6.0,
31    llm_int8_has_fp16_weight=False
32)
33
34# 4-bit quantization (NF4)
35bnb_config_4bit = BitsAndBytesConfig(
36    load_in_4bit=True,
37    bnb_4bit_quant_type="nf4",  # or "fp4"
38    bnb_4bit_compute_dtype=torch.bfloat16,
39    bnb_4bit_use_double_quant=True  # Nested quantization
40)
41
42model = AutoModelForCausalLM.from_pretrained(
43    "meta-llama/Llama-2-7b-hf",
44    quantization_config=bnb_config_4bit,
45    device_map="auto"
46)

llama.cpp y GGUF

Formato optimizado para inferencia en CPU:

1# Model conversion
2python convert.py llama-2-7b-hf --outfile llama-2-7b-f16.gguf --outtype f16
3
4# Quantization
5./quantize llama-2-7b-f16.gguf llama-2-7b-q4_k_m.gguf q4_k_m

Niveles de Cuantización GGUF

Formato	Bits	Tamaño (7B)	Calidad
Q2_K	2.5	2.7GB	Baja
Q3_K_M	3.4	3.3GB	Media-Baja
Q4_K_M	4.5	4.1GB	Media
Q5_K_M	5.5	4.8GB	Buena
Q6_K	6.5	5.5GB	Muy Buena
Q8_0	8	7.2GB	Excelente

Uso de GGUF con Python

1from llama_cpp import Llama
2
3llm = Llama(
4    model_path="llama-2-7b-q4_k_m.gguf",
5    n_ctx=4096,
6    n_threads=8,
7    n_gpu_layers=35  # GPU offloading
8)
9
10output = llm(
11    "What is artificial intelligence?",
12    max_tokens=256,
13    temperature=0.7
14)

Comparación de Benchmarks

Métricas de Rendimiento

1Model: Llama-2-7B
2Hardware: RTX 4090
3
4| Method | Memory | Tokens/s | Perplexity |
5|--------|--------|----------|------------|
6| FP16   | 14GB   | 45       | 5.47       |
7| INT8   | 7GB    | 82       | 5.49       |
8| GPTQ-4 | 4GB    | 125      | 5.63       |
9| AWQ-4  | 4GB    | 130      | 5.58       |
10| GGUF Q4| 4GB    | 95 (CPU) | 5.65       |

Optimización de Inferencia

Inferencia Rápida con vLLM

1from vllm import LLM, SamplingParams
2
3llm = LLM(
4    model="TheBloke/Llama-2-7B-GPTQ",
5    quantization="gptq",
6    tensor_parallel_size=2
7)
8
9sampling_params = SamplingParams(
10    temperature=0.8,
11    max_tokens=256
12)
13
14outputs = llm.generate(["Hello, "], sampling_params)

Integración con Flash Attention

1from transformers import AutoModelForCausalLM
2
3model = AutoModelForCausalLM.from_pretrained(
4    "meta-llama/Llama-2-7b-hf",
5    torch_dtype=torch.float16,
6    attn_implementation="flash_attention_2"
7)

Criterios de Selección

Matriz de Selección de Cuantización

1Use Case → Recommended Method
2
3Production API (GPU available):
4  → GPTQ or AWQ (4-bit)
5
6Edge/Mobile:
7  → GGUF Q4_K_M
8
9Fine-tuning required:
10  → QLoRA (4-bit BitsAndBytes)
11
12Maximum quality:
13  → INT8 or FP16
14
15Maximum speed:
16  → AWQ + vLLM

Conclusión

La cuantización es una técnica de optimización crítica que hace que los LLM sean más accesibles y rápidos. La elección del método adecuado depende del caso de uso y las limitaciones del hardware.

En Veni AI, ofrecemos consultoría en optimización de modelos.

Cuantización y optimización de modelos LLM: INT8, INT4 y GPTQ

Reference Overview