LLM Quantization and Model Optimization: INT8, INT4, and GPTQ
Quantization is the process of converting model weights and activations into lower-precision numerical formats. This process significantly reduces memory usage and inference time.
Quantization Fundamentals
Why Quantization?
| Metric | FP32 | FP16 | INT8 | INT4 |
|---|---|---|---|---|
| Bits/Parameter | 32 | 16 | 8 | 4 |
| 7B Model Size | 28GB | 14GB | 7GB | 3.5GB |
| Relative Speed | 1x | 1.5-2x | 2-4x | 3-5x |
Number Formats
1FP32: 1 bit sign + 8 bit exponent + 23 bit mantissa 2FP16: 1 bit sign + 5 bit exponent + 10 bit mantissa 3BF16: 1 bit sign + 8 bit exponent + 7 bit mantissa 4INT8: 8 bit integer (-128 to 127) 5INT4: 4 bit integer (-8 to 7)
Quantization Types
Post-Training Quantization (PTQ)
Quantization after training:
1import torch 2 3def quantize_tensor(tensor, bits=8): 4 # Min-max scaling 5 min_val = tensor.min() 6 max_val = tensor.max() 7 8 # Calculate scale and zero point 9 scale = (max_val - min_val) / (2**bits - 1) 10 zero_point = round(-min_val / scale) 11 12 # Quantize 13 q_tensor = torch.round(tensor / scale + zero_point) 14 q_tensor = torch.clamp(q_tensor, 0, 2**bits - 1) 15 16 return q_tensor.to(torch.uint8), scale, zero_point 17 18def dequantize_tensor(q_tensor, scale, zero_point): 19 return (q_tensor.float() - zero_point) * scale
Quantization-Aware Training (QAT)
Quantization simulation during training:
1class QuantizedLinear(nn.Module): 2 def __init__(self, in_features, out_features, bits=8): 3 super().__init__() 4 self.weight = nn.Parameter(torch.randn(out_features, in_features)) 5 self.bits = bits 6 7 def forward(self, x): 8 # Fake quantization during training 9 q_weight = fake_quantize(self.weight, self.bits) 10 return F.linear(x, q_weight) 11 12def fake_quantize(tensor, bits): 13 scale = tensor.abs().max() / (2**(bits-1) - 1) 14 q = torch.round(tensor / scale) 15 q = torch.clamp(q, -2**(bits-1), 2**(bits-1) - 1) 16 return q * scale # Straight-through estimator
GPTQ (Accurate Post-Training Quantization)
Layer-wise quantization with optimal reconstruction:
1from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig 2 3# Quantization config 4quantize_config = BaseQuantizeConfig( 5 bits=4, # INT4 6 group_size=128, # Group quantization 7 desc_act=False, # Disable activation order 8 damp_percent=0.1 # Dampening factor 9) 10 11# Model quantization 12model = AutoGPTQForCausalLM.from_pretrained( 13 "meta-llama/Llama-2-7b-hf", 14 quantize_config 15) 16 17# Quantize with calibration data 18model.quantize(calibration_data) 19 20# Save 21model.save_quantized("llama-2-7b-gptq")
GPTQ Working Principle
11. For each layer: 2 a. Calculate Hessian matrix (determines weight importance) 3 b. Quantize least important weights 4 c. Update remaining weights (error compensation) 5 d. Move to next column 6 72. Group quantization: 8 - 128 weight groups → 1 scale factor 9 - Better accuracy, slightly more memory
AWQ (Activation-aware Weight Quantization)
Preserving important weights based on activation distribution:
1from awq import AutoAWQForCausalLM 2 3model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf") 4 5quant_config = { 6 "zero_point": True, 7 "q_group_size": 128, 8 "w_bit": 4, 9 "version": "GEMM" 10} 11 12model.quantize( 13 tokenizer=tokenizer, 14 quant_config=quant_config, 15 calib_data=calibration_samples 16) 17 18model.save_quantized("llama-2-7b-awq")
BitsAndBytes Quantization
Hugging Face integration:
1from transformers import AutoModelForCausalLM, BitsAndBytesConfig 2import torch 3 4# 8-bit quantization 5bnb_config_8bit = BitsAndBytesConfig( 6 load_in_8bit=True, 7 llm_int8_threshold=6.0, 8 llm_int8_has_fp16_weight=False 9) 10 11# 4-bit quantization (NF4) 12bnb_config_4bit = BitsAndBytesConfig( 13 load_in_4bit=True, 14 bnb_4bit_quant_type="nf4", # or "fp4" 15 bnb_4bit_compute_dtype=torch.bfloat16, 16 bnb_4bit_use_double_quant=True # Nested quantization 17) 18 19model = AutoModelForCausalLM.from_pretrained( 20 "meta-llama/Llama-2-7b-hf", 21 quantization_config=bnb_config_4bit, 22 device_map="auto" 23)
llama.cpp and GGUF
Format optimized for CPU inference:
1# Model conversion 2python convert.py llama-2-7b-hf --outfile llama-2-7b-f16.gguf --outtype f16 3 4# Quantization 5./quantize llama-2-7b-f16.gguf llama-2-7b-q4_k_m.gguf q4_k_m
GGUF Quantization Levels
| Format | Bits | Size (7B) | Quality |
|---|---|---|---|
| Q2_K | 2.5 | 2.7GB | Low |
| Q3_K_M | 3.4 | 3.3GB | Medium-Low |
| Q4_K_M | 4.5 | 4.1GB | Medium |
| Q5_K_M | 5.5 | 4.8GB | Good |
| Q6_K | 6.5 | 5.5GB | Very Good |
| Q8_0 | 8 | 7.2GB | Best |
Using GGUF with Python
1from llama_cpp import Llama 2 3llm = Llama( 4 model_path="llama-2-7b-q4_k_m.gguf", 5 n_ctx=4096, 6 n_threads=8, 7 n_gpu_layers=35 # GPU offloading 8) 9 10output = llm( 11 "What is artificial intelligence?", 12 max_tokens=256, 13 temperature=0.7 14)
Benchmark Comparison
Performance Metrics
1Model: Llama-2-7B 2Hardware: RTX 4090 3 4| Method | Memory | Tokens/s | Perplexity | 5|--------|--------|----------|------------| 6| FP16 | 14GB | 45 | 5.47 | 7| INT8 | 7GB | 82 | 5.49 | 8| GPTQ-4 | 4GB | 125 | 5.63 | 9| AWQ-4 | 4GB | 130 | 5.58 | 10| GGUF Q4| 4GB | 95 (CPU) | 5.65 |
Inference Optimization
Fast Inference with vLLM
1from vllm import LLM, SamplingParams 2 3llm = LLM( 4 model="TheBloke/Llama-2-7B-GPTQ", 5 quantization="gptq", 6 tensor_parallel_size=2 7) 8 9sampling_params = SamplingParams( 10 temperature=0.8, 11 max_tokens=256 12) 13 14outputs = llm.generate(["Hello, "], sampling_params)
Flash Attention Integration
1from transformers import AutoModelForCausalLM 2 3model = AutoModelForCausalLM.from_pretrained( 4 "meta-llama/Llama-2-7b-hf", 5 torch_dtype=torch.float16, 6 attn_implementation="flash_attention_2" 7)
Selection Criteria
Quantization Selection Matrix
1Use Case → Recommended Method 2 3Production API (GPU available): 4 → GPTQ or AWQ (4-bit) 5 6Edge/Mobile: 7 → GGUF Q4_K_M 8 9Fine-tuning required: 10 → QLoRA (4-bit BitsAndBytes) 11 12Maximum quality: 13 → INT8 or FP16 14 15Maximum speed: 16 → AWQ + vLLM
Conclusion
Quantization is a critical optimization technique that makes LLMs more accessible and faster. Choosing the right method depends on the use case and hardware constraints.
At Veni AI, we provide consultancy on model optimization.
