Field	Value	Source
Canonical Path	/blog/llm-quantization-model-optimizasyonu-int8-int4	Veni AI Blog
Primary Category	Model Optimization	Post Metadata
Author	Veni AI Technical Team	Post Metadata

LLM Quantization and Model Optimization: INT8, INT4, and GPTQ

Quantization is the process of converting model weights and activations into lower-precision numerical formats. This process significantly reduces memory usage and inference time.

Quantization Fundamentals

Why Quantization?

Metric	FP32	FP16	INT8	INT4
Bits/Parameter	32	16	8	4
7B Model Size	28GB	14GB	7GB	3.5GB
Relative Speed	1x	1.5-2x	2-4x	3-5x

Number Formats

1FP32: 1 bit sign + 8 bit exponent + 23 bit mantissa
2FP16: 1 bit sign + 5 bit exponent + 10 bit mantissa
3BF16: 1 bit sign + 8 bit exponent + 7 bit mantissa
4INT8: 8 bit integer (-128 to 127)
5INT4: 4 bit integer (-8 to 7)

Quantization Types

Post-Training Quantization (PTQ)

Quantization after training:

1import torch
2
3def quantize_tensor(tensor, bits=8):
4    # Min-max scaling
5    min_val = tensor.min()
6    max_val = tensor.max()
7    
8    # Calculate scale and zero point
9    scale = (max_val - min_val) / (2**bits - 1)
10    zero_point = round(-min_val / scale)
11    
12    # Quantize
13    q_tensor = torch.round(tensor / scale + zero_point)
14    q_tensor = torch.clamp(q_tensor, 0, 2**bits - 1)
15    
16    return q_tensor.to(torch.uint8), scale, zero_point
17
18def dequantize_tensor(q_tensor, scale, zero_point):
19    return (q_tensor.float() - zero_point) * scale

Quantization-Aware Training (QAT)

Quantization simulation during training:

1class QuantizedLinear(nn.Module):
2    def __init__(self, in_features, out_features, bits=8):
3        super().__init__()
4        self.weight = nn.Parameter(torch.randn(out_features, in_features))
5        self.bits = bits
6    
7    def forward(self, x):
8        # Fake quantization during training
9        q_weight = fake_quantize(self.weight, self.bits)
10        return F.linear(x, q_weight)
11
12def fake_quantize(tensor, bits):
13    scale = tensor.abs().max() / (2**(bits-1) - 1)
14    q = torch.round(tensor / scale)
15    q = torch.clamp(q, -2**(bits-1), 2**(bits-1) - 1)
16    return q * scale  # Straight-through estimator

GPTQ (Accurate Post-Training Quantization)

Layer-wise quantization with optimal reconstruction:

1from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
2
3# Quantization config
4quantize_config = BaseQuantizeConfig(
5    bits=4,                     # INT4
6    group_size=128,             # Group quantization
7    desc_act=False,             # Disable activation order
8    damp_percent=0.1            # Dampening factor
9)
10
11# Model quantization
12model = AutoGPTQForCausalLM.from_pretrained(
13    "meta-llama/Llama-2-7b-hf",
14    quantize_config
15)
16
17# Quantize with calibration data
18model.quantize(calibration_data)
19
20# Save
21model.save_quantized("llama-2-7b-gptq")

GPTQ Working Principle

11. For each layer:
2   a. Calculate Hessian matrix (determines weight importance)
3   b. Quantize least important weights
4   c. Update remaining weights (error compensation)
5   d. Move to next column
6
72. Group quantization:
8   - 128 weight groups → 1 scale factor
9   - Better accuracy, slightly more memory

AWQ (Activation-aware Weight Quantization)

Preserving important weights based on activation distribution:

1from awq import AutoAWQForCausalLM
2
3model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
4
5quant_config = {
6    "zero_point": True,
7    "q_group_size": 128,
8    "w_bit": 4,
9    "version": "GEMM"
10}
11
12model.quantize(
13    tokenizer=tokenizer,
14    quant_config=quant_config,
15    calib_data=calibration_samples
16)
17
18model.save_quantized("llama-2-7b-awq")

BitsAndBytes Quantization

Hugging Face integration:

1from transformers import AutoModelForCausalLM, BitsAndBytesConfig
2import torch
3
4# 8-bit quantization
5bnb_config_8bit = BitsAndBytesConfig(
6    load_in_8bit=True,
7    llm_int8_threshold=6.0,
8    llm_int8_has_fp16_weight=False
9)
10
11# 4-bit quantization (NF4)
12bnb_config_4bit = BitsAndBytesConfig(
13    load_in_4bit=True,
14    bnb_4bit_quant_type="nf4",  # or "fp4"
15    bnb_4bit_compute_dtype=torch.bfloat16,
16    bnb_4bit_use_double_quant=True  # Nested quantization
17)
18
19model = AutoModelForCausalLM.from_pretrained(
20    "meta-llama/Llama-2-7b-hf",
21    quantization_config=bnb_config_4bit,
22    device_map="auto"
23)

llama.cpp and GGUF

Format optimized for CPU inference:

1# Model conversion
2python convert.py llama-2-7b-hf --outfile llama-2-7b-f16.gguf --outtype f16
3
4# Quantization
5./quantize llama-2-7b-f16.gguf llama-2-7b-q4_k_m.gguf q4_k_m

GGUF Quantization Levels

Format	Bits	Size (7B)	Quality
Q2_K	2.5	2.7GB	Low
Q3_K_M	3.4	3.3GB	Medium-Low
Q4_K_M	4.5	4.1GB	Medium
Q5_K_M	5.5	4.8GB	Good
Q6_K	6.5	5.5GB	Very Good
Q8_0	8	7.2GB	Best

Using GGUF with Python

1from llama_cpp import Llama
2
3llm = Llama(
4    model_path="llama-2-7b-q4_k_m.gguf",
5    n_ctx=4096,
6    n_threads=8,
7    n_gpu_layers=35  # GPU offloading
8)
9
10output = llm(
11    "What is artificial intelligence?",
12    max_tokens=256,
13    temperature=0.7
14)

Benchmark Comparison

Performance Metrics

1Model: Llama-2-7B
2Hardware: RTX 4090
3
4| Method | Memory | Tokens/s | Perplexity |
5|--------|--------|----------|------------|
6| FP16   | 14GB   | 45       | 5.47       |
7| INT8   | 7GB    | 82       | 5.49       |
8| GPTQ-4 | 4GB    | 125      | 5.63       |
9| AWQ-4  | 4GB    | 130      | 5.58       |
10| GGUF Q4| 4GB    | 95 (CPU) | 5.65       |

Inference Optimization

Fast Inference with vLLM

1from vllm import LLM, SamplingParams
2
3llm = LLM(
4    model="TheBloke/Llama-2-7B-GPTQ",
5    quantization="gptq",
6    tensor_parallel_size=2
7)
8
9sampling_params = SamplingParams(
10    temperature=0.8,
11    max_tokens=256
12)
13
14outputs = llm.generate(["Hello, "], sampling_params)

Flash Attention Integration

1from transformers import AutoModelForCausalLM
2
3model = AutoModelForCausalLM.from_pretrained(
4    "meta-llama/Llama-2-7b-hf",
5    torch_dtype=torch.float16,
6    attn_implementation="flash_attention_2"
7)

Selection Criteria

Quantization Selection Matrix

1Use Case → Recommended Method
2
3Production API (GPU available):
4  → GPTQ or AWQ (4-bit)
5
6Edge/Mobile:
7  → GGUF Q4_K_M
8
9Fine-tuning required:
10  → QLoRA (4-bit BitsAndBytes)
11
12Maximum quality:
13  → INT8 or FP16
14
15Maximum speed:
16  → AWQ + vLLM

Conclusion

Quantization is a critical optimization technique that makes LLMs more accessible and faster. Choosing the right method depends on the use case and hardware constraints.

At Veni AI, we provide consultancy on model optimization.

LLM Quantization and Model Optimization: INT8, INT4, and GPTQ

Reference Overview

LLM Quantization and Model Optimization: INT8, INT4, and GPTQ

Quantization Fundamentals

Why Quantization?

Number Formats

Quantization Types

Post-Training Quantization (PTQ)

Quantization-Aware Training (QAT)

GPTQ (Accurate Post-Training Quantization)

GPTQ Working Principle

AWQ (Activation-aware Weight Quantization)

BitsAndBytes Quantization

llama.cpp and GGUF

GGUF Quantization Levels

Using GGUF with Python

Benchmark Comparison

Performance Metrics

Inference Optimization

Fast Inference with vLLM

Flash Attention Integration

Selection Criteria

Quantization Selection Matrix

Conclusion

İlgili Makaleler

What Is OpenClaw? The Self-Hosted Agent Infrastructure Moving AI Beyond Chatbots

Enterprise AI Agent Standards: Operational Patterns Emerging in Early 2026

Enterprise AI Governance: Model Registry and Evaluation Standards