Veni AI
Model Optimization

LLM Quantization and Model Optimization: INT8, INT4, and GPTQ

Optimizing large language models with quantization, INT8/INT4 conversion, GPTQ, AWQ techniques, and inference acceleration strategies.

Veni AI Technical TeamJanuary 5, 20255 min read
LLM Quantization and Model Optimization: INT8, INT4, and GPTQ

LLM Quantization and Model Optimization: INT8, INT4, and GPTQ

Quantization is the process of converting model weights and activations into lower-precision numerical formats. This process significantly reduces memory usage and inference time.

Quantization Fundamentals

Why Quantization?

MetricFP32FP16INT8INT4
Bits/Parameter321684
7B Model Size28GB14GB7GB3.5GB
Relative Speed1x1.5-2x2-4x3-5x

Number Formats

1FP32: 1 bit sign + 8 bit exponent + 23 bit mantissa 2FP16: 1 bit sign + 5 bit exponent + 10 bit mantissa 3BF16: 1 bit sign + 8 bit exponent + 7 bit mantissa 4INT8: 8 bit integer (-128 to 127) 5INT4: 4 bit integer (-8 to 7)

Quantization Types

Post-Training Quantization (PTQ)

Quantization after training:

1import torch 2 3def quantize_tensor(tensor, bits=8): 4 # Min-max scaling 5 min_val = tensor.min() 6 max_val = tensor.max() 7 8 # Calculate scale and zero point 9 scale = (max_val - min_val) / (2**bits - 1) 10 zero_point = round(-min_val / scale) 11 12 # Quantize 13 q_tensor = torch.round(tensor / scale + zero_point) 14 q_tensor = torch.clamp(q_tensor, 0, 2**bits - 1) 15 16 return q_tensor.to(torch.uint8), scale, zero_point 17 18def dequantize_tensor(q_tensor, scale, zero_point): 19 return (q_tensor.float() - zero_point) * scale

Quantization-Aware Training (QAT)

Quantization simulation during training:

1class QuantizedLinear(nn.Module): 2 def __init__(self, in_features, out_features, bits=8): 3 super().__init__() 4 self.weight = nn.Parameter(torch.randn(out_features, in_features)) 5 self.bits = bits 6 7 def forward(self, x): 8 # Fake quantization during training 9 q_weight = fake_quantize(self.weight, self.bits) 10 return F.linear(x, q_weight) 11 12def fake_quantize(tensor, bits): 13 scale = tensor.abs().max() / (2**(bits-1) - 1) 14 q = torch.round(tensor / scale) 15 q = torch.clamp(q, -2**(bits-1), 2**(bits-1) - 1) 16 return q * scale # Straight-through estimator

GPTQ (Accurate Post-Training Quantization)

Layer-wise quantization with optimal reconstruction:

1from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig 2 3# Quantization config 4quantize_config = BaseQuantizeConfig( 5 bits=4, # INT4 6 group_size=128, # Group quantization 7 desc_act=False, # Disable activation order 8 damp_percent=0.1 # Dampening factor 9) 10 11# Model quantization 12model = AutoGPTQForCausalLM.from_pretrained( 13 "meta-llama/Llama-2-7b-hf", 14 quantize_config 15) 16 17# Quantize with calibration data 18model.quantize(calibration_data) 19 20# Save 21model.save_quantized("llama-2-7b-gptq")

GPTQ Working Principle

11. For each layer: 2 a. Calculate Hessian matrix (determines weight importance) 3 b. Quantize least important weights 4 c. Update remaining weights (error compensation) 5 d. Move to next column 6 72. Group quantization: 8 - 128 weight groups → 1 scale factor 9 - Better accuracy, slightly more memory

AWQ (Activation-aware Weight Quantization)

Preserving important weights based on activation distribution:

1from awq import AutoAWQForCausalLM 2 3model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf") 4 5quant_config = { 6 "zero_point": True, 7 "q_group_size": 128, 8 "w_bit": 4, 9 "version": "GEMM" 10} 11 12model.quantize( 13 tokenizer=tokenizer, 14 quant_config=quant_config, 15 calib_data=calibration_samples 16) 17 18model.save_quantized("llama-2-7b-awq")

BitsAndBytes Quantization

Hugging Face integration:

1from transformers import AutoModelForCausalLM, BitsAndBytesConfig 2import torch 3 4# 8-bit quantization 5bnb_config_8bit = BitsAndBytesConfig( 6 load_in_8bit=True, 7 llm_int8_threshold=6.0, 8 llm_int8_has_fp16_weight=False 9) 10 11# 4-bit quantization (NF4) 12bnb_config_4bit = BitsAndBytesConfig( 13 load_in_4bit=True, 14 bnb_4bit_quant_type="nf4", # or "fp4" 15 bnb_4bit_compute_dtype=torch.bfloat16, 16 bnb_4bit_use_double_quant=True # Nested quantization 17) 18 19model = AutoModelForCausalLM.from_pretrained( 20 "meta-llama/Llama-2-7b-hf", 21 quantization_config=bnb_config_4bit, 22 device_map="auto" 23)

llama.cpp and GGUF

Format optimized for CPU inference:

1# Model conversion 2python convert.py llama-2-7b-hf --outfile llama-2-7b-f16.gguf --outtype f16 3 4# Quantization 5./quantize llama-2-7b-f16.gguf llama-2-7b-q4_k_m.gguf q4_k_m

GGUF Quantization Levels

FormatBitsSize (7B)Quality
Q2_K2.52.7GBLow
Q3_K_M3.43.3GBMedium-Low
Q4_K_M4.54.1GBMedium
Q5_K_M5.54.8GBGood
Q6_K6.55.5GBVery Good
Q8_087.2GBBest

Using GGUF with Python

1from llama_cpp import Llama 2 3llm = Llama( 4 model_path="llama-2-7b-q4_k_m.gguf", 5 n_ctx=4096, 6 n_threads=8, 7 n_gpu_layers=35 # GPU offloading 8) 9 10output = llm( 11 "What is artificial intelligence?", 12 max_tokens=256, 13 temperature=0.7 14)

Benchmark Comparison

Performance Metrics

1Model: Llama-2-7B 2Hardware: RTX 4090 3 4| Method | Memory | Tokens/s | Perplexity | 5|--------|--------|----------|------------| 6| FP16 | 14GB | 45 | 5.47 | 7| INT8 | 7GB | 82 | 5.49 | 8| GPTQ-4 | 4GB | 125 | 5.63 | 9| AWQ-4 | 4GB | 130 | 5.58 | 10| GGUF Q4| 4GB | 95 (CPU) | 5.65 |

Inference Optimization

Fast Inference with vLLM

1from vllm import LLM, SamplingParams 2 3llm = LLM( 4 model="TheBloke/Llama-2-7B-GPTQ", 5 quantization="gptq", 6 tensor_parallel_size=2 7) 8 9sampling_params = SamplingParams( 10 temperature=0.8, 11 max_tokens=256 12) 13 14outputs = llm.generate(["Hello, "], sampling_params)

Flash Attention Integration

1from transformers import AutoModelForCausalLM 2 3model = AutoModelForCausalLM.from_pretrained( 4 "meta-llama/Llama-2-7b-hf", 5 torch_dtype=torch.float16, 6 attn_implementation="flash_attention_2" 7)

Selection Criteria

Quantization Selection Matrix

1Use Case → Recommended Method 2 3Production API (GPU available): 4 → GPTQ or AWQ (4-bit) 5 6Edge/Mobile: 7 → GGUF Q4_K_M 8 9Fine-tuning required: 10 → QLoRA (4-bit BitsAndBytes) 11 12Maximum quality: 13 → INT8 or FP16 14 15Maximum speed: 16 → AWQ + vLLM

Conclusion

Quantization is a critical optimization technique that makes LLMs more accessible and faster. Choosing the right method depends on the use case and hardware constraints.

At Veni AI, we provide consultancy on model optimization.

İlgili Makaleler