Veni AI
Αξιολόγηση Μοντέλων

Αξιολόγηση Μοντέλων ΤΝ: Οδηγός Μετρικών και Benchmark

Πλήρης οδηγός για τις μετρικές αξιολόγησης μοντέλων ΤΝ και LLM, τα σύνολα δεδομένων benchmark, MMLU, HellaSwag, την perplexity και την επιλογή μοντέλων για επιχειρήσεις.

Veni AI Technical Team31 Aralık 20245 dk okuma
Αξιολόγηση Μοντέλων ΤΝ: Οδηγός Μετρικών και Benchmark

Αξιολόγηση Μοντέλων AI: Οδηγός για Metrics και Benchmarks

Η ολοκληρωμένη αξιολόγηση είναι κρίσιμη για τη σωστή επιλογή μοντέλου. Σε αυτόν τον οδηγό εξετάζουμε τα metrics και τα benchmarks που χρησιμοποιούνται στην αξιολόγηση μοντέλων AI.

Βασικά Metrics

Perplexity

Μετρά πόσο καλά το language model προβλέπει κείμενο:

1import torch 2import math 3 4def calculate_perplexity(model, tokenizer, text): 5 encodings = tokenizer(text, return_tensors="pt") 6 7 with torch.no_grad(): 8 outputs = model(**encodings, labels=encodings["input_ids"]) 9 loss = outputs.loss 10 11 perplexity = math.exp(loss.item()) 12 return perplexity 13 14# Low perplexity = Better model 15# Typical values: 5-20 (good), >100 (bad)

Accuracy

Ποσοστό σωστών προβλέψεων:

1def accuracy(predictions, labels): 2 correct = sum(p == l for p, l in zip(predictions, labels)) 3 return correct / len(labels)

F1 Score

Ισορροπία μεταξύ Precision και Recall:

1from sklearn.metrics import f1_score, precision_score, recall_score 2 3def calculate_metrics(predictions, labels): 4 return { 5 "precision": precision_score(labels, predictions, average="weighted"), 6 "recall": recall_score(labels, predictions, average="weighted"), 7 "f1": f1_score(labels, predictions, average="weighted") 8 } 9## Μετρήσεις LLM 10 11### MMLU (Massive Multitask Language Understanding) 12 13Ερωτήσεις πολλαπλής επιλογής σε 57 θεματικές περιοχές: 14 15```python 16def evaluate_mmlu(model, dataset): 17 results = {} 18 19 for subject in dataset.subjects: 20 correct = 0 21 total = 0 22 23 for question in dataset.get_questions(subject): 24 prompt = format_mcq_prompt(question) 25 response = model.generate(prompt) 26 predicted = extract_answer(response) 27 28 if predicted == question.correct_answer: 29 correct += 1 30 total += 1 31 32 results[subject] = correct / total 33 34 return { 35 "subjects": results, 36 "average": sum(results.values()) / len(results) 37 }

Αποτελέσματα MMLU (2024):

ΜοντέλοΒαθμολογία
GPT86.4%
Claude 3 Opus86.8%
Gemini Ultra83.7%
Llama 3 70B79.5%

HellaSwag

Λογική βασισμένη στην κοινή γνώση:

1Context: "A woman is outside with a bucket and a dog. 2The dog is running around trying to avoid a bath. She..." 3 4Options: 5A) rinses the dog off with a hose (correct) 6B) calls the dog and feeds it 7C) throws the bucket at the dog 8D) walks into the house

TruthfulQA

Μέτρηση παραισθήσεων και αληθοφάνειας:

1def evaluate_truthfulness(model, questions): 2 truthful_count = 0 3 informative_count = 0 4 5 for q in questions: 6 response = model.generate(q.question) 7 8 # Human evaluation or classifier 9 is_truthful = check_truthfulness(response, q.ground_truth) 10 is_informative = check_informativeness(response) 11 12 if is_truthful: 13 truthful_count += 1 14 if is_informative: 15 informative_count += 1 16 17 return { 18 "truthful": truthful_count / len(questions), 19 "informative": informative_count / len(questions) 20 }

HumanEval

Ικανότητα παραγωγής κώδικα:

1def evaluate_humaneval(model, problems): 2 pass_at_1 = 0 3 pass_at_10 = 0 4 5 for problem in problems: 6 solutions = [model.generate_code(problem.prompt) for _ in range(10)] 7 8 passed = [run_tests(sol, problem.tests) for sol in solutions] 9 10 if passed[0]: 11 pass_at_1 += 1 12 if any(passed): 13 pass_at_10 += 1 14 15 return { 16 "pass@1": pass_at_1 / len(problems), 17 "pass@10": pass_at_10 / len(problems) 18 }

MT-Bench

Ποιότητα πολυγυρικής συνομιλίας:

1def mt_bench_evaluate(model, conversations): 2 scores = [] 3 4 for conv in conversations: 5 # Multi-turn dialog 6 responses = [] 7 for turn in conv.turns: 8 response = model.generate(turn.prompt, history=responses) 9 responses.append(response) 10 11 # GPT judge scoring (1-10) 12 score = gpt4_judge(conv.turns, responses) 13 scores.append(score) 14 15 return sum(scores) / len(scores) 16## Αξιολόγηση RAG 17 18### Μετρικές Ανάκτησης 19 20```python 21def retrieval_metrics(retrieved_docs, relevant_docs, k=10): 22 retrieved_k = retrieved_docs[:k] 23 relevant_set = set(relevant_docs) 24 25 # Recall@K 26 retrieved_relevant = len(set(retrieved_k) & relevant_set) 27 recall_k = retrieved_relevant / len(relevant_set) 28 29 # Precision@K 30 precision_k = retrieved_relevant / k 31 32 # MRR (Mean Reciprocal Rank) 33 mrr = 0 34 for i, doc in enumerate(retrieved_k): 35 if doc in relevant_set: 36 mrr = 1 / (i + 1) 37 break 38 39 return { 40 "recall@k": recall_k, 41 "precision@k": precision_k, 42 "mrr": mrr 43 }

Μετρικές RAGAS

1from ragas import evaluate 2from ragas.metrics import faithfulness, answer_relevancy, context_precision 3 4def evaluate_rag(questions, answers, contexts, ground_truths): 5 dataset = { 6 "question": questions, 7 "answer": answers, 8 "contexts": contexts, 9 "ground_truth": ground_truths 10 } 11 12 results = evaluate( 13 dataset, 14 metrics=[faithfulness, answer_relevancy, context_precision] 15 ) 16 17 return results

Μετρικές Δημιουργίας Κειμένου

BLEU Score

1from nltk.translate.bleu_score import sentence_bleu 2 3def calculate_bleu(reference, candidate): 4 reference_tokens = [reference.split()] 5 candidate_tokens = candidate.split() 6 7 return sentence_bleu(reference_tokens, candidate_tokens)

ROUGE Score

1from rouge_score import rouge_scorer 2 3def calculate_rouge(reference, candidate): 4 scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL']) 5 scores = scorer.score(reference, candidate) 6 7 return { 8 "rouge1": scores['rouge1'].fmeasure, 9 "rouge2": scores['rouge2'].fmeasure, 10 "rougeL": scores['rougeL'].fmeasure 11 }

BERTScore

Σημασιολογική ομοιότητα:

1from bert_score import score 2 3def calculate_bertscore(references, candidates): 4 P, R, F1 = score(candidates, references, lang="tr") 5 return { 6 "precision": P.mean().item(), 7 "recall": R.mean().item(), 8 "f1": F1.mean().item() 9 }

LLM-as-Judge

Αξιολόγηση με GPT:

1def llm_judge(response, criteria): 2 prompt = f"""Evaluate the following response. 3 4Response: {response} 5 6Evaluation criteria: 7{criteria} 8 9Rate from 1-10 and explain your reasoning. 10JSON format: {{"score": X, "reasoning": "..."}} 11""" 12 13 result = client.chat.completions.create( 14 model="gpt-4-turbo", 15 response_format={"type": "json_object"}, 16 messages=[{"role": "user", "content": prompt}] 17 ) 18 19 return json.loads(result.choices[0].message.content)

Πλαίσιο A/B Testing

1class ModelABTest: 2 def __init__(self, model_a, model_b): 3 self.model_a = model_a 4 self.model_b = model_b 5 self.results = {"a_wins": 0, "b_wins": 0, "ties": 0} 6 7 def compare(self, prompt): 8 response_a = self.model_a.generate(prompt) 9 response_b = self.model_b.generate(prompt) 10 11 # Blind comparison with LLM judge 12 winner = self.judge_comparison(prompt, response_a, response_b) 13 14 self.results[f"{winner}_wins"] += 1 15 16 return { 17 "response_a": response_a, 18 "response_b": response_b, 19 "winner": winner 20 } 21 22 def get_statistics(self): 23 total = sum(self.results.values()) 24 return { 25 "model_a_win_rate": self.results["a_wins"] / total, 26 "model_b_win_rate": self.results["b_wins"] / total, 27 "tie_rate": self.results["ties"] / total 28 } 29## Σύγκριση Leaderboard 30 31### Open LLM Leaderboard 32
ModelMMLUHellaSwagTruthfulQAAverage
GPT86.4%95.3%59.0%80.2%
Claude 3 Opus86.8%95.4%60.2%80.8%
Gemini Pro79.1%87.8%47.0%71.3%
Llama 3 70B79.5%88.0%45.0%70.8%
Mistral Large81.2%89.2%50.0%73.5%
1 2## Αξιολόγηση Enterprise 3 4### Προσαρμοσμένο Benchmark 5 6```python 7class EnterpriseEvaluation: 8 def __init__(self, model, test_cases): 9 self.model = model 10 self.test_cases = test_cases 11 12 def evaluate(self): 13 results = { 14 "accuracy": [], 15 "latency": [], 16 "cost": [], 17 "safety": [] 18 } 19 20 for case in self.test_cases: 21 start = time.time() 22 response = self.model.generate(case.prompt) 23 latency = time.time() - start 24 25 results["latency"].append(latency) 26 results["accuracy"].append( 27 self.check_accuracy(response, case.expected) 28 ) 29 results["safety"].append( 30 self.check_safety(response) 31 ) 32 33 return { 34 "avg_accuracy": np.mean(results["accuracy"]), 35 "p95_latency": np.percentile(results["latency"], 95), 36 "safety_rate": np.mean(results["safety"]) 37 }

Συμπέρασμα

Η αξιολόγηση μοντέλων είναι ένα κρίσιμο βήμα για την επιτυχία των έργων AI. Με τις κατάλληλες μετρικές και benchmarks, μπορείτε να λαμβάνετε ενημερωμένες αποφάσεις επιλογής μοντέλων και να διασφαλίζετε συνεχή βελτίωση.

Στη Veni AI, προσφέρουμε υπηρεσίες enterprise αξιολόγησης μοντέλων.

İlgili Makaleler