Évaluation des modèles d’IA : guide des métriques et benchmarks
Une évaluation complète est essentielle pour une sélection correcte des modèles. Dans ce guide, nous examinons les métriques et benchmarks utilisés pour évaluer les modèles d’IA.
Métriques de base
Perplexité
Mesure la capacité du modèle de langage à prédire du texte :
1import torch 2import math 3 4def calculate_perplexity(model, tokenizer, text): 5 encodings = tokenizer(text, return_tensors="pt") 6 7 with torch.no_grad(): 8 outputs = model(**encodings, labels=encodings["input_ids"]) 9 loss = outputs.loss 10 11 perplexity = math.exp(loss.item()) 12 return perplexity 13 14# Low perplexity = Better model 15# Typical values: 5-20 (good), >100 (bad)
Exactitude
Taux de prédictions correctes :
1def accuracy(predictions, labels): 2 correct = sum(p == l for p, l in zip(predictions, labels)) 3 return correct / len(labels)
Score F1
Équilibre entre la précision et le rappel :
1from sklearn.metrics import f1_score, precision_score, recall_score 2 3def calculate_metrics(predictions, labels): 4 return { 5 "precision": precision_score(labels, predictions, average="weighted"), 6 "recall": recall_score(labels, predictions, average="weighted"), 7 "f1": f1_score(labels, predictions, average="weighted") 8 } 9## Benchmarks LLM 10 11### MMLU (Massive Multitask Language Understanding) 12 13Questions à choix multiple dans 57 domaines : 14 15```python 16def evaluate_mmlu(model, dataset): 17 results = {} 18 19 for subject in dataset.subjects: 20 correct = 0 21 total = 0 22 23 for question in dataset.get_questions(subject): 24 prompt = format_mcq_prompt(question) 25 response = model.generate(prompt) 26 predicted = extract_answer(response) 27 28 if predicted == question.correct_answer: 29 correct += 1 30 total += 1 31 32 results[subject] = correct / total 33 34 return { 35 "subjects": results, 36 "average": sum(results.values()) / len(results) 37 }
Résultats MMLU (2024) :
| Modèle | Score |
|---|---|
| GPT | 86.4% |
| Claude 3 Opus | 86.8% |
| Gemini Ultra | 83.7% |
| Llama 3 70B | 79.5% |
HellaSwag
Raisonnement de sens commun :
1Context: "A woman is outside with a bucket and a dog. 2The dog is running around trying to avoid a bath. She..." 3 4Options: 5A) rinses the dog off with a hose (correct) 6B) calls the dog and feeds it 7C) throws the bucket at the dog 8D) walks into the house
TruthfulQA
Mesure de l’hallucination et de la véracité :
1def evaluate_truthfulness(model, questions): 2 truthful_count = 0 3 informative_count = 0 4 5 for q in questions: 6 response = model.generate(q.question) 7 8 # Human evaluation or classifier 9 is_truthful = check_truthfulness(response, q.ground_truth) 10 is_informative = check_informativeness(response) 11 12 if is_truthful: 13 truthful_count += 1 14 if is_informative: 15 informative_count += 1 16 17 return { 18 "truthful": truthful_count / len(questions), 19 "informative": informative_count / len(questions) 20 }
HumanEval
Capacité de génération de code :
1def evaluate_humaneval(model, problems): 2 pass_at_1 = 0 3 pass_at_10 = 0 4 5 for problem in problems: 6 solutions = [model.generate_code(problem.prompt) for _ in range(10)] 7 8 passed = [run_tests(sol, problem.tests) for sol in solutions] 9 10 if passed[0]: 11 pass_at_1 += 1 12 if any(passed): 13 pass_at_10 += 1 14 15 return { 16 "pass@1": pass_at_1 / len(problems), 17 "pass@10": pass_at_10 / len(problems) 18 }
MT-Bench
Qualité de conversation multi‑tour :
1def mt_bench_evaluate(model, conversations): 2 scores = [] 3 4 for conv in conversations: 5 # Multi-turn dialog 6 responses = [] 7 for turn in conv.turns: 8 response = model.generate(turn.prompt, history=responses) 9 responses.append(response) 10 11 # GPT judge scoring (1-10) 12 score = gpt4_judge(conv.turns, responses) 13 scores.append(score) 14 15 return sum(scores) / len(scores) 16## Évaluation RAG 17 18### Métriques de récupération 19 20```python 21def retrieval_metrics(retrieved_docs, relevant_docs, k=10): 22 retrieved_k = retrieved_docs[:k] 23 relevant_set = set(relevant_docs) 24 25 # Recall@K 26 retrieved_relevant = len(set(retrieved_k) & relevant_set) 27 recall_k = retrieved_relevant / len(relevant_set) 28 29 # Precision@K 30 precision_k = retrieved_relevant / k 31 32 # MRR (Mean Reciprocal Rank) 33 mrr = 0 34 for i, doc in enumerate(retrieved_k): 35 if doc in relevant_set: 36 mrr = 1 / (i + 1) 37 break 38 39 return { 40 "recall@k": recall_k, 41 "precision@k": precision_k, 42 "mrr": mrr 43 }
Métriques RAGAS
1from ragas import evaluate 2from ragas.metrics import faithfulness, answer_relevancy, context_precision 3 4def evaluate_rag(questions, answers, contexts, ground_truths): 5 dataset = { 6 "question": questions, 7 "answer": answers, 8 "contexts": contexts, 9 "ground_truth": ground_truths 10 } 11 12 results = evaluate( 13 dataset, 14 metrics=[faithfulness, answer_relevancy, context_precision] 15 ) 16 17 return results
Métriques de génération de texte
Score BLEU
1from nltk.translate.bleu_score import sentence_bleu 2 3def calculate_bleu(reference, candidate): 4 reference_tokens = [reference.split()] 5 candidate_tokens = candidate.split() 6 7 return sentence_bleu(reference_tokens, candidate_tokens)
Score ROUGE
1from rouge_score import rouge_scorer 2 3def calculate_rouge(reference, candidate): 4 scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL']) 5 scores = scorer.score(reference, candidate) 6 7 return { 8 "rouge1": scores['rouge1'].fmeasure, 9 "rouge2": scores['rouge2'].fmeasure, 10 "rougeL": scores['rougeL'].fmeasure 11 }
BERTScore
Similarité sémantique :
1from bert_score import score 2 3def calculate_bertscore(references, candidates): 4 P, R, F1 = score(candidates, references, lang="tr") 5 return { 6 "precision": P.mean().item(), 7 "recall": R.mean().item(), 8 "f1": F1.mean().item() 9 }
LLM-as-Judge
Évaluation avec GPT :
1def llm_judge(response, criteria): 2 prompt = f"""Evaluate the following response. 3 4Response: {response} 5 6Evaluation criteria: 7{criteria} 8 9Rate from 1-10 and explain your reasoning. 10JSON format: {{"score": X, "reasoning": "..."}} 11""" 12 13 result = client.chat.completions.create( 14 model="gpt-4-turbo", 15 response_format={"type": "json_object"}, 16 messages=[{"role": "user", "content": prompt}] 17 ) 18 19 return json.loads(result.choices[0].message.content)
Cadre de test A/B
1class ModelABTest: 2 def __init__(self, model_a, model_b): 3 self.model_a = model_a 4 self.model_b = model_b 5 self.results = {"a_wins": 0, "b_wins": 0, "ties": 0} 6 7 def compare(self, prompt): 8 response_a = self.model_a.generate(prompt) 9 response_b = self.model_b.generate(prompt) 10 11 # Blind comparison with LLM judge 12 winner = self.judge_comparison(prompt, response_a, response_b) 13 14 self.results[f"{winner}_wins"] += 1 15 16 return { 17 "response_a": response_a, 18 "response_b": response_b, 19 "winner": winner 20 } 21 22 def get_statistics(self): 23 total = sum(self.results.values()) 24 return { 25 "model_a_win_rate": self.results["a_wins"] / total, 26 "model_b_win_rate": self.results["b_wins"] / total, 27 "tie_rate": self.results["ties"] / total 28 } 29## Comparaison des classements 30 31### Classement Open LLM 32
| Model | MMLU | HellaSwag | TruthfulQA | Average |
|---|---|---|---|---|
| GPT | 86.4% | 95.3% | 59.0% | 80.2% |
| Claude 3 Opus | 86.8% | 95.4% | 60.2% | 80.8% |
| Gemini Pro | 79.1% | 87.8% | 47.0% | 71.3% |
| Llama 3 70B | 79.5% | 88.0% | 45.0% | 70.8% |
| Mistral Large | 81.2% | 89.2% | 50.0% | 73.5% |
1 2## Évaluation pour l’entreprise 3 4### Benchmark personnalisé 5 6```python 7class EnterpriseEvaluation: 8 def __init__(self, model, test_cases): 9 self.model = model 10 self.test_cases = test_cases 11 12 def evaluate(self): 13 results = { 14 "accuracy": [], 15 "latency": [], 16 "cost": [], 17 "safety": [] 18 } 19 20 for case in self.test_cases: 21 start = time.time() 22 response = self.model.generate(case.prompt) 23 latency = time.time() - start 24 25 results["latency"].append(latency) 26 results["accuracy"].append( 27 self.check_accuracy(response, case.expected) 28 ) 29 results["safety"].append( 30 self.check_safety(response) 31 ) 32 33 return { 34 "avg_accuracy": np.mean(results["accuracy"]), 35 "p95_latency": np.percentile(results["latency"], 95), 36 "safety_rate": np.mean(results["safety"]) 37 }
Conclusion
L’évaluation des modèles est une étape essentielle pour la réussite des projets d’IA. Avec les bons indicateurs et benchmarks, vous pouvez faire des choix de modèles éclairés et garantir une amélioration continue.
Chez Veni AI, nous proposons des services d’évaluation de modèles pour les entreprises.
