Fine-Tuning and Transfer Learning: Model Training Guide
Fine-tuning is the process of customizing pre-trained models for specific tasks or domains. With the right fine-tuning strategies, performance increases of up to 40% can be achieved in enterprise AI solutions.
Transfer Learning Fundamentals
Transfer learning is the transfer of knowledge learned in one task to another task.
Advantages of Transfer Learning
- Data Efficiency: Good results with less data
- Time Saving: Much faster than training from scratch
- Cost Reduction: Less compute resources
- Performance: Leveraging pre-trained knowledge
Pre-training vs Fine-tuning
1Pre-training: 2- Large, general dataset (TBs) 3- Learning general language/task understanding 4- Training takes months 5- Cost in millions of dollars 6 7Fine-tuning: 8- Small, domain-specific dataset (MB-GB) 9- Specific task adaptation 10- Training takes hours-days 11- Cost in thousands of dollars
Full Fine-Tuning
Updating all model parameters.
Advantages
- Maximum adaptation capacity
- Highest potential performance
Disadvantages
- High memory requirement
- Risk of catastrophic forgetting
- Separate model copy for each task
Hardware Requirements
| Model Size | GPU Memory (FP32) | GPU Memory (FP16) |
|---|---|---|
| 7B | 28 GB | 14 GB |
| 13B | 52 GB | 26 GB |
| 70B | 280 GB | 140 GB |
Parameter-Efficient Fine-Tuning (PEFT)
Fine-tuning by updating only a small portion of parameters.
PEFT Advantages
- Memory Efficiency: 90%+ reduction
- Speed: Faster training
- Modularity: Single base model, multiple adapters
- Catastrophic Forgetting: Minimized risk
LoRA (Low-Rank Adaptation)
The most popular PEFT method.
LoRA Theory
Updating the weight matrix approximately with low-rank matrices:
1W' = W + ΔW = W + BA 2 3Where: 4- W: Original weight matrix (d × k) 5- B: Low-rank matrix (d × r) 6- A: Low-rank matrix (r × k) 7- r: Rank (typical: 8-64)
Parameter Savings
1Original: d × k parameters 2LoRA: r × (d + k) parameters 3 4Example (d=4096, k=4096, r=16): 5Original: 16.7M parameters 6LoRA: 131K parameters 7Savings: ~127x
LoRA Configuration
1from peft import LoraConfig, get_peft_model 2 3config = LoraConfig( 4 r=16, # Rank 5 lora_alpha=32, # Scaling factor 6 target_modules=[ # Which layers to apply 7 "q_proj", 8 "k_proj", 9 "v_proj", 10 "o_proj" 11 ], 12 lora_dropout=0.05, 13 bias="none", 14 task_type="CAUSAL_LM" 15) 16 17model = get_peft_model(base_model, config)
LoRA Hyperparameters
Rank (r):
- Low (4-8): Simple tasks, little data
- Medium (16-32): General use
- High (64-128): Complex adaptation
Alpha:
- Generally alpha = 2 × r
Target Modules:
- Attention layers: q_proj, k_proj, v_proj, o_proj
- MLP layers: gate_proj, up_proj, down_proj
QLoRA (Quantized LoRA)
Combination of LoRA + 4-bit quantization.
QLoRA Features
- 4-bit NormalFloat (NF4): Custom quantization format
- Double Quantization: Quantizing quantization constants
- Paged Optimizers: GPU memory overflow management
QLoRA Memory Comparison
| Method | 7B Model | 70B Model |
|---|---|---|
| Full FT (FP32) | 28 GB | 280 GB |
| Full FT (FP16) | 14 GB | 140 GB |
| LoRA (FP16) | 12 GB | 120 GB |
| QLoRA (4-bit) | 6 GB | 48 GB |
QLoRA Implementation
1from transformers import BitsAndBytesConfig 2import torch 3 4bnb_config = BitsAndBytesConfig( 5 load_in_4bit=True, 6 bnb_4bit_use_double_quant=True, 7 bnb_4bit_quant_type="nf4", 8 bnb_4bit_compute_dtype=torch.bfloat16 9) 10 11model = AutoModelForCausalLM.from_pretrained( 12 "meta-llama/Llama-2-7b-hf", 13 quantization_config=bnb_config, 14 device_map="auto" 15)
Other PEFT Methods
Prefix Tuning
Adds learnable prefixes to input embeddings:
Input: [PREFIX_1, PREFIX_2, ..., PREFIX_N, token_1, token_2, ...]
Prompt Tuning
Learning soft prompts:
[SOFT_PROMPT] + "Actual input text"
Adapter Layers
Adding small networks between transformer layers:
Attention → Adapter → LayerNorm → FFN → Adapter → LayerNorm
(IA)³ - Infused Adapter
Multiplying activations with learned vectors:
output = activation × learned_vector
Data Preparation
Data Formats
Instruction Format:
1{ 2 "instruction": "Summarize this text", 3 "input": "Long text...", 4 "output": "Summary..." 5}
Chat Format:
1{ 2 "messages": [ 3 {"role": "system", "content": "You are a helpful assistant"}, 4 {"role": "user", "content": "Question..."}, 5 {"role": "assistant", "content": "Answer..."} 6 ] 7}
Data Quality
Good Data Characteristics:
- Diversity (diverse examples)
- Consistency (consistent format)
- Accuracy (accurate labels)
- Sufficient quantity (usually 1K-100K examples)
Data Augmentation
1# Paraphrasing 2augmented_data = paraphrase(original_data) 3 4# Back-translation 5translated = translate(text, "tr") 6back_translated = translate(translated, "en") 7 8# Synonym replacement 9augmented = replace_synonyms(text)
Training Strategies
Hyperparameter Selection
1training_args = TrainingArguments( 2 learning_rate=2e-4, # Typical for LoRA 3 num_train_epochs=3, 4 per_device_train_batch_size=4, 5 gradient_accumulation_steps=4, 6 warmup_ratio=0.03, 7 lr_scheduler_type="cosine", 8 fp16=True, 9 logging_steps=10, 10 save_strategy="epoch", 11 evaluation_strategy="epoch" 12)
Learning Rate
- Full fine-tuning: 1e-5 - 5e-5
- LoRA: 1e-4 - 3e-4
- QLoRA: 2e-4 - 5e-4
Regularization
1# Weight decay 2weight_decay=0.01 3 4# Dropout 5lora_dropout=0.05 6 7# Gradient clipping 8max_grad_norm=1.0
Evaluation and Validation
Metrics
Perplexity:
PPL = exp(average cross-entropy loss) Lower = better
BLEU/ROUGE: Text generation quality
Task-specific: Accuracy, F1, custom metrics
Detecting Overfitting
1Train loss ↓ + Validation loss ↑ = Overfitting 2 3Solutions: 4- Early stopping 5- More dropout 6- Data augmentation 7- Fewer epochs
Deployment
Model Merging
Merging LoRA adapter into base model:
merged_model = model.merge_and_unload() merged_model.save_pretrained("merged_model")
Multi-Adapter Serving
Multiple adapters with a single base model:
1from peft import PeftModel 2 3base_model = AutoModelForCausalLM.from_pretrained("base") 4model_a = PeftModel.from_pretrained(base_model, "adapter_a") 5model_b = PeftModel.from_pretrained(base_model, "adapter_b")
Enterprise Fine-Tuning Pipeline
1┌─────────────┐ ┌─────────────┐ ┌─────────────┐ 2│ Data │────▶│ Training │────▶│ Evaluation │ 3│ Preparation │ │ (LoRA/QLoRA)│ │ & Testing │ 4└─────────────┘ └─────────────┘ └──────┬──────┘ 5 │ 6 ┌─────────────┐ ┌──────▼──────┐ 7 │ Production │◀────│ Model │ 8 │ Deployment │ │ Registry │ 9 └─────────────┘ └─────────────┘
Common Issues and Solutions
1. Out of Memory
Solution: QLoRA, gradient checkpointing, reducing batch size
2. Catastrophic Forgetting
Solution: Lower learning rate, replay buffer, elastic weight consolidation
3. Overfitting
Solution: More data, regularization, early stopping
4. Poor Generalization
Solution: Increasing data diversity, instruction diversity
Conclusion
Fine-tuning is the most effective way to adapt pre-trained models to enterprise needs. Powerful customizations can be made even with limited resources using PEFT methods like LoRA and QLoRA.
At Veni AI, we provide consultancy and implementation services for enterprise fine-tuning projects. Contact us for your needs.
