Field	Value	Source
Canonical Path	/blog/fine-tuning-transfer-learning-model-egitimi-rehberi	Veni AI Blog
Primary Category	模型训练	Post Metadata
Author	Veni AI Technical Team	Post Metadata

微调与迁移学习：模型训练指南

微调是针对特定任务或领域对预训练模型进行定制化的过程。采用正确的微调策略，可在企业级 AI 解决方案中实现高达 40% 的性能提升。

迁移学习基础

迁移学习是将从一个任务中学到的知识迁移到另一个任务中。

迁移学习的优势

数据效率： 使用更少的数据获得良好效果
节省时间： 比从零开始训练快得多
降低成本： 需要更少的计算资源
性能提升： 利用预训练知识

预训练 vs 微调

1Pre-training:
2- Large, general dataset (TBs)
3- Learning general language/task understanding
4- Training takes months
5- Cost in millions of dollars
6
7Fine-tuning:
8- Small, domain-specific dataset (MB-GB)
9- Specific task adaptation
10- Training takes hours-days
11- Cost in thousands of dollars

全量微调（Full Fine-Tuning）

更新模型的全部参数。

优点

最大的适应能力
潜在性能最高

缺点

高显存需求
存在灾难性遗忘风险
每个任务都需要单独的模型副本

硬件需求

模型规模	GPU 显存 (FP32)	GPU 显存 (FP16)
7B	28 GB	14 GB
13B	52 GB	26 GB
70B	280 GB	140 GB

参数高效微调（PEFT）

仅更新少量参数的微调方式。

PEFT 优势

显存效率： 显存减少 90%+
速度快： 训练更迅速
模块化： 单一基础模型可适配多个任务
避免灾难性遗忘： 风险更低

LoRA（Low-Rank Adaptation）

最流行的 PEFT 方法。

LoRA 理论

通过低秩矩阵近似更新权重矩阵：

1W' = W + ΔW = W + BA
2
3Where:
4- W: Original weight matrix (d × k)
5- B: Low-rank matrix (d × r)
6- A: Low-rank matrix (r × k)
7- r: Rank (typical: 8-64)

参数节省

1Original: d × k parameters
2LoRA: r × (d + k) parameters
3
4Example (d=4096, k=4096, r=16):
5Original: 16.7M parameters
6LoRA: 131K parameters
7Savings: ~127x

LoRA 配置

1from peft import LoraConfig, get_peft_model
2
3config = LoraConfig(
4    r=16,                      # Rank
5    lora_alpha=32,             # Scaling factor
6    target_modules=[           # Which layers to apply
7        "q_proj",
8        "k_proj", 
9        "v_proj",
10        "o_proj"
11    ],
12    lora_dropout=0.05,
13    bias="none",
14    task_type="CAUSAL_LM"
15)
16
17model = get_peft_model(base_model, config)

LoRA 超参数

Rank (r):

低 (4–8)：简单任务、数据量小
中 (16–32)：通用选择
高 (64–128)：复杂任务

Alpha：

通常 alpha = 2 × r

Target Modules：

注意力层：q_proj, k_proj, v_proj, o_proj
MLP 层：gate_proj, up_proj, down_proj

QLoRA（量化 LoRA）

LoRA + 4-bit 量化的组合。

QLoRA 特性

4-bit NormalFloat (NF4)： 自定义量化格式
双重量化： 对量化常数再次量化
Paged Optimizers： 处理 GPU 显存溢出

QLoRA 显存对比

方法	7B 模型	70B 模型
Full FT (FP32)	28 GB	280 GB
Full FT (FP16)	14 GB	140 GB
LoRA (FP16)	12 GB	120 GB
QLoRA (4-bit)	6 GB	48 GB

QLoRA 实现

1from transformers import BitsAndBytesConfig
2import torch
3
4bnb_config = BitsAndBytesConfig(
5    load_in_4bit=True,
6    bnb_4bit_use_double_quant=True,
7    bnb_4bit_quant_type="nf4",
8    bnb_4bit_compute_dtype=torch.bfloat16
9)
10
11model = AutoModelForCausalLM.from_pretrained(
12    "meta-llama/Llama-2-7b-hf",
13    quantization_config=bnb_config,
14    device_map="auto"
15)
16## 其他 PEFT 方法
17
18### Prefix Tuning
19
20向输入 embeddings 添加可学习的前缀：
21

Input: [PREFIX_1, PREFIX_2, ..., PREFIX_N, token_1, token_2, ...]

1
2### Prompt Tuning
3
4学习 soft prompts：
5

[SOFT_PROMPT] + "Actual input text"

1
2### Adapter Layers
3
4在 transformer 层之间添加小型网络：
5

Attention → Adapter → LayerNorm → FFN → Adapter → LayerNorm

1
2### (IA)³ - Infused Adapter
3
4将激活与可学习向量相乘：
5

output = activation × learned_vector

1
2## 数据准备
3
4### 数据格式
5
6**Instruction 格式：**
7```json
8{
9  "instruction": "Summarize this text",
10  "input": "Long text...",
11  "output": "Summary..."
12}

Chat 格式：

1{
2  "messages": [
3    {"role": "system", "content": "You are a helpful assistant"},
4    {"role": "user", "content": "Question..."},
5    {"role": "assistant", "content": "Answer..."}
6  ]
7}

数据质量

优质数据特征：

多样性（示例多样）
一致性（格式一致）
准确性（标签准确）
足够数量（通常 1K-100K 示例）

数据增强

1# Paraphrasing
2augmented_data = paraphrase(original_data)
3
4# Back-translation
5translated = translate(text, "tr")
6back_translated = translate(translated, "en")
7
8# Synonym replacement
9augmented = replace_synonyms(text)

训练策略

超参数选择

1training_args = TrainingArguments(
2    learning_rate=2e-4,        # Typical for LoRA
3    num_train_epochs=3,
4    per_device_train_batch_size=4,
5    gradient_accumulation_steps=4,
6    warmup_ratio=0.03,
7    lr_scheduler_type="cosine",
8    fp16=True,
9    logging_steps=10,
10    save_strategy="epoch",
11    evaluation_strategy="epoch"
12)

学习率

Full fine-tuning: 1e-5 - 5e-5
LoRA: 1e-4 - 3e-4
QLoRA: 2e-4 - 5e-4

正则化

1# Weight decay
2weight_decay=0.01
3
4# Dropout
5lora_dropout=0.05
6
7# Gradient clipping
8max_grad_norm=1.0

评估与验证

指标

Perplexity:

PPL = exp(average cross-entropy loss)
Lower = better

BLEU/ROUGE： 文本生成质量

任务特定指标： Accuracy, F1, 自定义指标

检测过拟合

1Train loss ↓ + Validation loss ↑ = Overfitting
2
3Solutions:
4- Early stopping
5- More dropout
6- Data augmentation
7- Fewer epochs

部署

模型合并

将 LoRA adapter 合并到基础模型：

merged_model = model.merge_and_unload()
merged_model.save_pretrained("merged_model")

多 Adapter 服务

使用单个基础模型加载多个 adapters：

1from peft import PeftModel
2
3base_model = AutoModelForCausalLM.from_pretrained("base")
4model_a = PeftModel.from_pretrained(base_model, "adapter_a")
5model_b = PeftModel.from_pretrained(base_model, "adapter_b")

企业级微调流程

1┌─────────────┐     ┌─────────────┐     ┌─────────────┐
2│ Data        │────▶│ Training    │────▶│ Evaluation  │
3│ Preparation │     │ (LoRA/QLoRA)│     │ & Testing   │
4└─────────────┘     └─────────────┘     └──────┬──────┘
5                                                │
6                    ┌─────────────┐     ┌──────▼──────┐
7                    │ Production  │◀────│ Model       │
8                    │ Deployment  │     │ Registry    │
9                    └─────────────┘     └─────────────┘

常见问题与解决方案

1. 内存不足

解决方案： QLoRA、gradient checkpointing、减小 batch size

2. 灾难性遗忘

解决方案： 降低学习率、replay buffer、elastic weight consolidation

3. 过拟合

解决方案： 更多数据、正则化、early stopping

4. 泛化能力差

解决方案： 增加数据多样性、指令多样性

结论

微调是将预训练模型适配企业需求的最有效方式。即使资源有限，也可以使用 LoRA 和 QLoRA 等 PEFT 方法实现强大的定制能力。

在 Veni AI，我们为企业级微调项目提供咨询与实施服务。如有需求，欢迎联系我们。