AI Model Evaluation: Metrics and Benchmark Guide

Comprehensive evaluation is critical for correct model selection. In this guide, we examine the metrics and benchmarks used in evaluating AI models.

Basic Metrics

Perplexity

Measures how well the language model predicts text:

1import torch
2import math
3
4def calculate_perplexity(model, tokenizer, text):
5    encodings = tokenizer(text, return_tensors="pt")
6    
7    with torch.no_grad():
8        outputs = model(**encodings, labels=encodings["input_ids"])
9        loss = outputs.loss
10    
11    perplexity = math.exp(loss.item())
12    return perplexity
13
14# Low perplexity = Better model
15# Typical values: 5-20 (good), >100 (bad)

Accuracy

Correct prediction rate:

1def accuracy(predictions, labels):
2    correct = sum(p == l for p, l in zip(predictions, labels))
3    return correct / len(labels)

F1 Score

Balance between Precision and Recall:

1from sklearn.metrics import f1_score, precision_score, recall_score
2
3def calculate_metrics(predictions, labels):
4    return {
5        "precision": precision_score(labels, predictions, average="weighted"),
6        "recall": recall_score(labels, predictions, average="weighted"),
7        "f1": f1_score(labels, predictions, average="weighted")
8    }

LLM Benchmarks

MMLU (Massive Multitask Language Understanding)

Multiple choice questions in 57 subject areas:

1def evaluate_mmlu(model, dataset):
2    results = {}
3    
4    for subject in dataset.subjects:
5        correct = 0
6        total = 0
7        
8        for question in dataset.get_questions(subject):
9            prompt = format_mcq_prompt(question)
10            response = model.generate(prompt)
11            predicted = extract_answer(response)
12            
13            if predicted == question.correct_answer:
14                correct += 1
15            total += 1
16        
17        results[subject] = correct / total
18    
19    return {
20        "subjects": results,
21        "average": sum(results.values()) / len(results)
22    }

MMLU Results (2024):

Model	Score
GPT	86.4%
Claude 3 Opus	86.8%
Gemini Ultra	83.7%
Llama 3 70B	79.5%

HellaSwag

Commonsense reasoning:

1Context: "A woman is outside with a bucket and a dog. 
2The dog is running around trying to avoid a bath. She..."
3
4Options:
5A) rinses the dog off with a hose (correct)
6B) calls the dog and feeds it
7C) throws the bucket at the dog
8D) walks into the house

TruthfulQA

Hallucination and truthfulness measurement:

1def evaluate_truthfulness(model, questions):
2    truthful_count = 0
3    informative_count = 0
4    
5    for q in questions:
6        response = model.generate(q.question)
7        
8        # Human evaluation or classifier
9        is_truthful = check_truthfulness(response, q.ground_truth)
10        is_informative = check_informativeness(response)
11        
12        if is_truthful:
13            truthful_count += 1
14        if is_informative:
15            informative_count += 1
16    
17    return {
18        "truthful": truthful_count / len(questions),
19        "informative": informative_count / len(questions)
20    }

HumanEval

Code generation capability:

1def evaluate_humaneval(model, problems):
2    pass_at_1 = 0
3    pass_at_10 = 0
4    
5    for problem in problems:
6        solutions = [model.generate_code(problem.prompt) for _ in range(10)]
7        
8        passed = [run_tests(sol, problem.tests) for sol in solutions]
9        
10        if passed[0]:
11            pass_at_1 += 1
12        if any(passed):
13            pass_at_10 += 1
14    
15    return {
16        "pass@1": pass_at_1 / len(problems),
17        "pass@10": pass_at_10 / len(problems)
18    }

MT-Bench

Multi-turn conversation quality:

1def mt_bench_evaluate(model, conversations):
2    scores = []
3    
4    for conv in conversations:
5        # Multi-turn dialog
6        responses = []
7        for turn in conv.turns:
8            response = model.generate(turn.prompt, history=responses)
9            responses.append(response)
10        
11        # GPT judge scoring (1-10)
12        score = gpt4_judge(conv.turns, responses)
13        scores.append(score)
14    
15    return sum(scores) / len(scores)

RAG Evaluation

Retrieval Metrics

1def retrieval_metrics(retrieved_docs, relevant_docs, k=10):
2    retrieved_k = retrieved_docs[:k]
3    relevant_set = set(relevant_docs)
4    
5    # Recall@K
6    retrieved_relevant = len(set(retrieved_k) & relevant_set)
7    recall_k = retrieved_relevant / len(relevant_set)
8    
9    # Precision@K
10    precision_k = retrieved_relevant / k
11    
12    # MRR (Mean Reciprocal Rank)
13    mrr = 0
14    for i, doc in enumerate(retrieved_k):
15        if doc in relevant_set:
16            mrr = 1 / (i + 1)
17            break
18    
19    return {
20        "recall@k": recall_k,
21        "precision@k": precision_k,
22        "mrr": mrr
23    }

RAGAS Metrics

1from ragas import evaluate
2from ragas.metrics import faithfulness, answer_relevancy, context_precision
3
4def evaluate_rag(questions, answers, contexts, ground_truths):
5    dataset = {
6        "question": questions,
7        "answer": answers,
8        "contexts": contexts,
9        "ground_truth": ground_truths
10    }
11    
12    results = evaluate(
13        dataset,
14        metrics=[faithfulness, answer_relevancy, context_precision]
15    )
16    
17    return results

Text Generation Metrics

BLEU Score

1from nltk.translate.bleu_score import sentence_bleu
2
3def calculate_bleu(reference, candidate):
4    reference_tokens = [reference.split()]
5    candidate_tokens = candidate.split()
6    
7    return sentence_bleu(reference_tokens, candidate_tokens)

ROUGE Score

1from rouge_score import rouge_scorer
2
3def calculate_rouge(reference, candidate):
4    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'])
5    scores = scorer.score(reference, candidate)
6    
7    return {
8        "rouge1": scores['rouge1'].fmeasure,
9        "rouge2": scores['rouge2'].fmeasure,
10        "rougeL": scores['rougeL'].fmeasure
11    }

BERTScore

Semantic similarity:

1from bert_score import score
2
3def calculate_bertscore(references, candidates):
4    P, R, F1 = score(candidates, references, lang="tr")
5    return {
6        "precision": P.mean().item(),
7        "recall": R.mean().item(),
8        "f1": F1.mean().item()
9    }

LLM-as-Judge

Evaluation with GPT:

1def llm_judge(response, criteria):
2    prompt = f"""Evaluate the following response.
3
4Response: {response}
5
6Evaluation criteria:
7{criteria}
8
9Rate from 1-10 and explain your reasoning.
10JSON format: {{"score": X, "reasoning": "..."}}
11"""
12    
13    result = client.chat.completions.create(
14        model="gpt-4-turbo",
15        response_format={"type": "json_object"},
16        messages=[{"role": "user", "content": prompt}]
17    )
18    
19    return json.loads(result.choices[0].message.content)

A/B Testing Framework

1class ModelABTest:
2    def __init__(self, model_a, model_b):
3        self.model_a = model_a
4        self.model_b = model_b
5        self.results = {"a_wins": 0, "b_wins": 0, "ties": 0}
6    
7    def compare(self, prompt):
8        response_a = self.model_a.generate(prompt)
9        response_b = self.model_b.generate(prompt)
10        
11        # Blind comparison with LLM judge
12        winner = self.judge_comparison(prompt, response_a, response_b)
13        
14        self.results[f"{winner}_wins"] += 1
15        
16        return {
17            "response_a": response_a,
18            "response_b": response_b,
19            "winner": winner
20        }
21    
22    def get_statistics(self):
23        total = sum(self.results.values())
24        return {
25            "model_a_win_rate": self.results["a_wins"] / total,
26            "model_b_win_rate": self.results["b_wins"] / total,
27            "tie_rate": self.results["ties"] / total
28        }

Leaderboard Comparison

Open LLM Leaderboard

1Model             | MMLU  | HellaSwag | TruthfulQA | Average
2------------------|-------|-----------|------------|--------
3GPT             | 86.4% | 95.3%     | 59.0%      | 80.2%
4Claude 3 Opus     | 86.8% | 95.4%     | 60.2%      | 80.8%
5Gemini Pro        | 79.1% | 87.8%     | 47.0%      | 71.3%
6Llama 3 70B       | 79.5% | 88.0%     | 45.0%      | 70.8%
7Mistral Large     | 81.2% | 89.2%     | 50.0%      | 73.5%

Enterprise Evaluation

Custom Benchmark

1class EnterpriseEvaluation:
2    def __init__(self, model, test_cases):
3        self.model = model
4        self.test_cases = test_cases
5    
6    def evaluate(self):
7        results = {
8            "accuracy": [],
9            "latency": [],
10            "cost": [],
11            "safety": []
12        }
13        
14        for case in self.test_cases:
15            start = time.time()
16            response = self.model.generate(case.prompt)
17            latency = time.time() - start
18            
19            results["latency"].append(latency)
20            results["accuracy"].append(
21                self.check_accuracy(response, case.expected)
22            )
23            results["safety"].append(
24                self.check_safety(response)
25            )
26        
27        return {
28            "avg_accuracy": np.mean(results["accuracy"]),
29            "p95_latency": np.percentile(results["latency"], 95),
30            "safety_rate": np.mean(results["safety"])
31        }

Conclusion

Model evaluation is a critical step for the success of AI projects. You can make informed model selections and ensure continuous improvement with the right metrics and benchmarks.

At Veni AI, we offer enterprise model evaluation services.

AI Model Evaluation: Metrics and Benchmark Guide

AI Model Evaluation: Metrics and Benchmark Guide

Basic Metrics

Perplexity

Accuracy

F1 Score

LLM Benchmarks

MMLU (Massive Multitask Language Understanding)

HellaSwag

TruthfulQA

HumanEval

MT-Bench

RAG Evaluation

Retrieval Metrics

RAGAS Metrics

Text Generation Metrics

BLEU Score

ROUGE Score

BERTScore

LLM-as-Judge

A/B Testing Framework

Leaderboard Comparison

Open LLM Leaderboard

Enterprise Evaluation

Custom Benchmark

Conclusion

İlgili Makaleler

Enterprise AI Agent Standards: Operational Patterns Emerging in Early 2026

Enterprise AI Governance: Model Registry and Evaluation Standards

Multimodal RAG Developments: Combining Vector and Graph Search