AI Model Evaluation: Metrikler ve Benchmark Rehberi
Doğru model seçimi için kapsamlı değerlendirme kritik önem taşır. Bu rehberde AI modellerinin değerlendirilmesinde kullanılan metrikleri ve benchmark'ları inceliyoruz.
Temel Metrikler
Perplexity
Dil modelinin metin tahmin kalitesini ölçer:
1import torch 2import math 3 4def calculate_perplexity(model, tokenizer, text): 5 encodings = tokenizer(text, return_tensors="pt") 6 7 with torch.no_grad(): 8 outputs = model(**encodings, labels=encodings["input_ids"]) 9 loss = outputs.loss 10 11 perplexity = math.exp(loss.item()) 12 return perplexity 13 14# Düşük perplexity = Daha iyi model 15# Tipik değerler: 5-20 (iyi), >100 (kötü)
Accuracy
Doğru tahmin oranı:
1def accuracy(predictions, labels): 2 correct = sum(p == l for p, l in zip(predictions, labels)) 3 return correct / len(labels)
F1 Score
Precision ve Recall dengesi:
1from sklearn.metrics import f1_score, precision_score, recall_score 2 3def calculate_metrics(predictions, labels): 4 return { 5 "precision": precision_score(labels, predictions, average="weighted"), 6 "recall": recall_score(labels, predictions, average="weighted"), 7 "f1": f1_score(labels, predictions, average="weighted") 8 }
LLM Benchmark'ları
MMLU (Massive Multitask Language Understanding)
57 konu alanında çoktan seçmeli sorular:
1def evaluate_mmlu(model, dataset): 2 results = {} 3 4 for subject in dataset.subjects: 5 correct = 0 6 total = 0 7 8 for question in dataset.get_questions(subject): 9 prompt = format_mcq_prompt(question) 10 response = model.generate(prompt) 11 predicted = extract_answer(response) 12 13 if predicted == question.correct_answer: 14 correct += 1 15 total += 1 16 17 results[subject] = correct / total 18 19 return { 20 "subjects": results, 21 "average": sum(results.values()) / len(results) 22 }
MMLU Sonuçları (2024):
| Model | Score |
|---|---|
| GPT | 86.4% |
| Claude 3 Opus | 86.8% |
| Gemini Ultra | 83.7% |
| Llama 3 70B | 79.5% |
HellaSwag
Commonsense reasoning:
1Context: "A woman is outside with a bucket and a dog. 2The dog is running around trying to avoid a bath. She..." 3 4Options: 5A) rinses the dog off with a hose (correct) 6B) calls the dog and feeds it 7C) throws the bucket at the dog 8D) walks into the house
TruthfulQA
Halüsinasyon ve doğruluk ölçümü:
1def evaluate_truthfulness(model, questions): 2 truthful_count = 0 3 informative_count = 0 4 5 for q in questions: 6 response = model.generate(q.question) 7 8 # Human veya classifier ile değerlendirme 9 is_truthful = check_truthfulness(response, q.ground_truth) 10 is_informative = check_informativeness(response) 11 12 if is_truthful: 13 truthful_count += 1 14 if is_informative: 15 informative_count += 1 16 17 return { 18 "truthful": truthful_count / len(questions), 19 "informative": informative_count / len(questions) 20 }
HumanEval
Kod üretme yeteneği:
1def evaluate_humaneval(model, problems): 2 pass_at_1 = 0 3 pass_at_10 = 0 4 5 for problem in problems: 6 solutions = [model.generate_code(problem.prompt) for _ in range(10)] 7 8 passed = [run_tests(sol, problem.tests) for sol in solutions] 9 10 if passed[0]: 11 pass_at_1 += 1 12 if any(passed): 13 pass_at_10 += 1 14 15 return { 16 "pass@1": pass_at_1 / len(problems), 17 "pass@10": pass_at_10 / len(problems) 18 }
MT-Bench
Multi-turn conversation quality:
1def mt_bench_evaluate(model, conversations): 2 scores = [] 3 4 for conv in conversations: 5 # Multi-turn dialog 6 responses = [] 7 for turn in conv.turns: 8 response = model.generate(turn.prompt, history=responses) 9 responses.append(response) 10 11 # GPT judge scoring (1-10) 12 score = gpt4_judge(conv.turns, responses) 13 scores.append(score) 14 15 return sum(scores) / len(scores)
RAG Evaluation
Retrieval Metrics
1def retrieval_metrics(retrieved_docs, relevant_docs, k=10): 2 retrieved_k = retrieved_docs[:k] 3 relevant_set = set(relevant_docs) 4 5 # Recall@K 6 retrieved_relevant = len(set(retrieved_k) & relevant_set) 7 recall_k = retrieved_relevant / len(relevant_set) 8 9 # Precision@K 10 precision_k = retrieved_relevant / k 11 12 # MRR (Mean Reciprocal Rank) 13 mrr = 0 14 for i, doc in enumerate(retrieved_k): 15 if doc in relevant_set: 16 mrr = 1 / (i + 1) 17 break 18 19 return { 20 "recall@k": recall_k, 21 "precision@k": precision_k, 22 "mrr": mrr 23 }
RAGAS Metrics
1from ragas import evaluate 2from ragas.metrics import faithfulness, answer_relevancy, context_precision 3 4def evaluate_rag(questions, answers, contexts, ground_truths): 5 dataset = { 6 "question": questions, 7 "answer": answers, 8 "contexts": contexts, 9 "ground_truth": ground_truths 10 } 11 12 results = evaluate( 13 dataset, 14 metrics=[faithfulness, answer_relevancy, context_precision] 15 ) 16 17 return results
Text Generation Metrics
BLEU Score
1from nltk.translate.bleu_score import sentence_bleu 2 3def calculate_bleu(reference, candidate): 4 reference_tokens = [reference.split()] 5 candidate_tokens = candidate.split() 6 7 return sentence_bleu(reference_tokens, candidate_tokens)
ROUGE Score
1from rouge_score import rouge_scorer 2 3def calculate_rouge(reference, candidate): 4 scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL']) 5 scores = scorer.score(reference, candidate) 6 7 return { 8 "rouge1": scores['rouge1'].fmeasure, 9 "rouge2": scores['rouge2'].fmeasure, 10 "rougeL": scores['rougeL'].fmeasure 11 }
BERTScore
Semantic similarity:
1from bert_score import score 2 3def calculate_bertscore(references, candidates): 4 P, R, F1 = score(candidates, references, lang="tr") 5 return { 6 "precision": P.mean().item(), 7 "recall": R.mean().item(), 8 "f1": F1.mean().item() 9 }
LLM-as-Judge
GPT ile değerlendirme:
1def llm_judge(response, criteria): 2 prompt = f"""Aşağıdaki yanıtı değerlendir. 3 4Yanıt: {response} 5 6Değerlendirme kriterleri: 7{criteria} 8 91-10 arası puan ver ve gerekçeni açıkla. 10JSON formatı: {{"score": X, "reasoning": "..."}} 11""" 12 13 result = client.chat.completions.create( 14 model="gpt-4-turbo", 15 response_format={"type": "json_object"}, 16 messages=[{"role": "user", "content": prompt}] 17 ) 18 19 return json.loads(result.choices[0].message.content)
A/B Testing Framework
1class ModelABTest: 2 def __init__(self, model_a, model_b): 3 self.model_a = model_a 4 self.model_b = model_b 5 self.results = {"a_wins": 0, "b_wins": 0, "ties": 0} 6 7 def compare(self, prompt): 8 response_a = self.model_a.generate(prompt) 9 response_b = self.model_b.generate(prompt) 10 11 # Blind comparison with LLM judge 12 winner = self.judge_comparison(prompt, response_a, response_b) 13 14 self.results[f"{winner}_wins"] += 1 15 16 return { 17 "response_a": response_a, 18 "response_b": response_b, 19 "winner": winner 20 } 21 22 def get_statistics(self): 23 total = sum(self.results.values()) 24 return { 25 "model_a_win_rate": self.results["a_wins"] / total, 26 "model_b_win_rate": self.results["b_wins"] / total, 27 "tie_rate": self.results["ties"] / total 28 }
Leaderboard Karşılaştırması
Open LLM Leaderboard
1Model | MMLU | HellaSwag | TruthfulQA | Average 2------------------|-------|-----------|------------|-------- 3GPT | 86.4% | 95.3% | 59.0% | 80.2% 4Claude 3 Opus | 86.8% | 95.4% | 60.2% | 80.8% 5Gemini Pro | 79.1% | 87.8% | 47.0% | 71.3% 6Llama 3 70B | 79.5% | 88.0% | 45.0% | 70.8% 7Mistral Large | 81.2% | 89.2% | 50.0% | 73.5%
Kurumsal Değerlendirme
Custom Benchmark
1class EnterpriseEvaluation: 2 def __init__(self, model, test_cases): 3 self.model = model 4 self.test_cases = test_cases 5 6 def evaluate(self): 7 results = { 8 "accuracy": [], 9 "latency": [], 10 "cost": [], 11 "safety": [] 12 } 13 14 for case in self.test_cases: 15 start = time.time() 16 response = self.model.generate(case.prompt) 17 latency = time.time() - start 18 19 results["latency"].append(latency) 20 results["accuracy"].append( 21 self.check_accuracy(response, case.expected) 22 ) 23 results["safety"].append( 24 self.check_safety(response) 25 ) 26 27 return { 28 "avg_accuracy": np.mean(results["accuracy"]), 29 "p95_latency": np.percentile(results["latency"], 95), 30 "safety_rate": np.mean(results["safety"]) 31 }
Sonuç
Model değerlendirme, AI projelerinin başarısı için kritik bir adımdır. Doğru metrikler ve benchmark'lar ile bilinçli model seçimi yapabilir ve sürekli iyileştirme sağlayabilirsiniz.
Veni AI olarak, kurumsal model değerlendirme hizmetleri sunuyoruz.
