Оценка AI‑моделей: руководство по метрикам и бенчмаркам
Комплексная оценка имеет ключевое значение для правильного выбора модели. В этом руководстве мы рассматриваем метрики и бенчмарки, используемые при оценке AI‑моделей.
Базовые метрики
Perplexity
Измеряет, насколько хорошо языковая модель предсказывает текст:
1import torch 2import math 3 4def calculate_perplexity(model, tokenizer, text): 5 encodings = tokenizer(text, return_tensors="pt") 6 7 with torch.no_grad(): 8 outputs = model(**encodings, labels=encodings["input_ids"]) 9 loss = outputs.loss 10 11 perplexity = math.exp(loss.item()) 12 return perplexity 13 14# Low perplexity = Better model 15# Typical values: 5-20 (good), >100 (bad)
Accuracy
Доля корректных предсказаний:
1def accuracy(predictions, labels): 2 correct = sum(p == l for p, l in zip(predictions, labels)) 3 return correct / len(labels)
F1 Score
Баланс между Precision и Recall:
1from sklearn.metrics import f1_score, precision_score, recall_score 2 3def calculate_metrics(predictions, labels): 4 return { 5 "precision": precision_score(labels, predictions, average="weighted"), 6 "recall": recall_score(labels, predictions, average="weighted"), 7 "f1": f1_score(labels, predictions, average="weighted") 8 } 9## Бенчмарки LLM 10 11### MMLU (Massive Multitask Language Understanding) 12 13Вопросы с несколькими вариантами ответа по 57 предметным областям: 14 15```python 16def evaluate_mmlu(model, dataset): 17 results = {} 18 19 for subject in dataset.subjects: 20 correct = 0 21 total = 0 22 23 for question in dataset.get_questions(subject): 24 prompt = format_mcq_prompt(question) 25 response = model.generate(prompt) 26 predicted = extract_answer(response) 27 28 if predicted == question.correct_answer: 29 correct += 1 30 total += 1 31 32 results[subject] = correct / total 33 34 return { 35 "subjects": results, 36 "average": sum(results.values()) / len(results) 37 }
Результаты MMLU (2024):
| Model | Score |
|---|---|
| GPT | 86.4% |
| Claude 3 Opus | 86.8% |
| Gemini Ultra | 83.7% |
| Llama 3 70B | 79.5% |
HellaSwag
Бытовое и здравосмысленное рассуждение:
1Context: "A woman is outside with a bucket and a dog. 2The dog is running around trying to avoid a bath. She..." 3 4Options: 5A) rinses the dog off with a hose (correct) 6B) calls the dog and feeds it 7C) throws the bucket at the dog 8D) walks into the house
TruthfulQA
Оценка галлюцинаций и правдивости:
1def evaluate_truthfulness(model, questions): 2 truthful_count = 0 3 informative_count = 0 4 5 for q in questions: 6 response = model.generate(q.question) 7 8 # Human evaluation or classifier 9 is_truthful = check_truthfulness(response, q.ground_truth) 10 is_informative = check_informativeness(response) 11 12 if is_truthful: 13 truthful_count += 1 14 if is_informative: 15 informative_count += 1 16 17 return { 18 "truthful": truthful_count / len(questions), 19 "informative": informative_count / len(questions) 20 }
HumanEval
Способность генерации кода:
1def evaluate_humaneval(model, problems): 2 pass_at_1 = 0 3 pass_at_10 = 0 4 5 for problem in problems: 6 solutions = [model.generate_code(problem.prompt) for _ in range(10)] 7 8 passed = [run_tests(sol, problem.tests) for sol in solutions] 9 10 if passed[0]: 11 pass_at_1 += 1 12 if any(passed): 13 pass_at_10 += 1 14 15 return { 16 "pass@1": pass_at_1 / len(problems), 17 "pass@10": pass_at_10 / len(problems) 18 }
MT-Bench
Качество многотурового диалога:
1def mt_bench_evaluate(model, conversations): 2 scores = [] 3 4 for conv in conversations: 5 # Multi-turn dialog 6 responses = [] 7 for turn in conv.turns: 8 response = model.generate(turn.prompt, history=responses) 9 responses.append(response) 10 11 # GPT judge scoring (1-10) 12 score = gpt4_judge(conv.turns, responses) 13 scores.append(score) 14 15 return sum(scores) / len(scores) 16## Оценка RAG 17 18### Метрики извлечения 19 20```python 21def retrieval_metrics(retrieved_docs, relevant_docs, k=10): 22 retrieved_k = retrieved_docs[:k] 23 relevant_set = set(relevant_docs) 24 25 # Recall@K 26 retrieved_relevant = len(set(retrieved_k) & relevant_set) 27 recall_k = retrieved_relevant / len(relevant_set) 28 29 # Precision@K 30 precision_k = retrieved_relevant / k 31 32 # MRR (Mean Reciprocal Rank) 33 mrr = 0 34 for i, doc in enumerate(retrieved_k): 35 if doc in relevant_set: 36 mrr = 1 / (i + 1) 37 break 38 39 return { 40 "recall@k": recall_k, 41 "precision@k": precision_k, 42 "mrr": mrr 43 }
Метрики RAGAS
1from ragas import evaluate 2from ragas.metrics import faithfulness, answer_relevancy, context_precision 3 4def evaluate_rag(questions, answers, contexts, ground_truths): 5 dataset = { 6 "question": questions, 7 "answer": answers, 8 "contexts": contexts, 9 "ground_truth": ground_truths 10 } 11 12 results = evaluate( 13 dataset, 14 metrics=[faithfulness, answer_relevancy, context_precision] 15 ) 16 17 return results
Метрики генерации текста
BLEU Score
1from nltk.translate.bleu_score import sentence_bleu 2 3def calculate_bleu(reference, candidate): 4 reference_tokens = [reference.split()] 5 candidate_tokens = candidate.split() 6 7 return sentence_bleu(reference_tokens, candidate_tokens)
ROUGE Score
1from rouge_score import rouge_scorer 2 3def calculate_rouge(reference, candidate): 4 scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL']) 5 scores = scorer.score(reference, candidate) 6 7 return { 8 "rouge1": scores['rouge1'].fmeasure, 9 "rouge2": scores['rouge2'].fmeasure, 10 "rougeL": scores['rougeL'].fmeasure 11 }
BERTScore
Семантическое сходство:
1from bert_score import score 2 3def calculate_bertscore(references, candidates): 4 P, R, F1 = score(candidates, references, lang="tr") 5 return { 6 "precision": P.mean().item(), 7 "recall": R.mean().item(), 8 "f1": F1.mean().item() 9 }
LLM-as-Judge
Оценка с помощью GPT:
1def llm_judge(response, criteria): 2 prompt = f"""Evaluate the following response. 3 4Response: {response} 5 6Evaluation criteria: 7{criteria} 8 9Rate from 1-10 and explain your reasoning. 10JSON format: {{"score": X, "reasoning": "..."}} 11""" 12 13 result = client.chat.completions.create( 14 model="gpt-4-turbo", 15 response_format={"type": "json_object"}, 16 messages=[{"role": "user", "content": prompt}] 17 ) 18 19 return json.loads(result.choices[0].message.content)
A/B тестирование моделей
1class ModelABTest: 2 def __init__(self, model_a, model_b): 3 self.model_a = model_a 4 self.model_b = model_b 5 self.results = {"a_wins": 0, "b_wins": 0, "ties": 0} 6 7 def compare(self, prompt): 8 response_a = self.model_a.generate(prompt) 9 response_b = self.model_b.generate(prompt) 10 11 # Blind comparison with LLM judge 12 winner = self.judge_comparison(prompt, response_a, response_b) 13 14 self.results[f"{winner}_wins"] += 1 15 16 return { 17 "response_a": response_a, 18 "response_b": response_b, 19 "winner": winner 20 } 21 22 def get_statistics(self): 23 total = sum(self.results.values()) 24 return { 25 "model_a_win_rate": self.results["a_wins"] / total, 26 "model_b_win_rate": self.results["b_wins"] / total, 27 "tie_rate": self.results["ties"] / total 28 } 29## Сравнение в таблице лидеров 30 31### Open LLM Leaderboard 32
| Model | MMLU | HellaSwag | TruthfulQA | Average |
|---|---|---|---|---|
| GPT | 86.4% | 95.3% | 59.0% | 80.2% |
| Claude 3 Opus | 86.8% | 95.4% | 60.2% | 80.8% |
| Gemini Pro | 79.1% | 87.8% | 47.0% | 71.3% |
| Llama 3 70B | 79.5% | 88.0% | 45.0% | 70.8% |
| Mistral Large | 81.2% | 89.2% | 50.0% | 73.5% |
1 2## Оценка для предприятий 3 4### Пользовательский бенчмарк 5 6```python 7class EnterpriseEvaluation: 8 def __init__(self, model, test_cases): 9 self.model = model 10 self.test_cases = test_cases 11 12 def evaluate(self): 13 results = { 14 "accuracy": [], 15 "latency": [], 16 "cost": [], 17 "safety": [] 18 } 19 20 for case in self.test_cases: 21 start = time.time() 22 response = self.model.generate(case.prompt) 23 latency = time.time() - start 24 25 results["latency"].append(latency) 26 results["accuracy"].append( 27 self.check_accuracy(response, case.expected) 28 ) 29 results["safety"].append( 30 self.check_safety(response) 31 ) 32 33 return { 34 "avg_accuracy": np.mean(results["accuracy"]), 35 "p95_latency": np.percentile(results["latency"], 95), 36 "safety_rate": np.mean(results["safety"]) 37 }
Заключение
Оценка моделей — это критически важный этап для успешной реализации AI‑проектов. Используя правильные метрики и бенчмарки, вы можете принимать обоснованные решения при выборе моделей и обеспечивать их постоянное улучшение.
В Veni AI мы предоставляем услуги по оценке моделей для предприятий.
