Skip to content
ENGINEERING·2026·GUIDE

How to Evaluate LLMs on Your Own Data (Practical Eval Harness for Engineers)

Featured image

Public benchmarks and leaderboards are useful, but they rarely reflect your real production workload.

A model that performs well on public datasets might fail when exposed to your prompts, your documents, and your users’ behavior.

That’s why serious AI teams build a small evaluation harness to test models on their own data before deploying anything.

This guide explains how to create a simple, practical LLM evaluation setup that lets you compare models like GPT, Claude, Gemini, and open-source alternatives using your real tasks.

Benchmark platforms measure models using standard datasets. Those results are useful for broad comparison, but production workloads often look very different.

For example:

  • A customer support chatbot must answer questions from messy user queries
  • A documentation assistant must summarize long technical documents
  • A coding assistant must generate structured, correct code

These real tasks are rarely captured by public benchmarks.

A practical evaluation harness focuses on a few high-signal metrics rather than dozens of academic benchmarks.

The most important metric is simply:

Did the model produce the correct result?

Examples:

  • Did the summary capture the key points?
  • Did the generated SQL query run correctly?
  • Did the chatbot answer the question accurately?

Accuracy is often measured using:

  • human review
  • expected answers
  • rule-based checks

Large language models sometimes confidently generate incorrect information.

You should track:

  • unsupported claims
  • fabricated citations
  • incorrect numbers or facts

For applications like documentation assistants or research tools, hallucinations can be more dangerous than simple errors.

A model that produces great answers but takes 8 seconds per request may not be usable in real products.

Measure:

  • average response time
  • p95 latency (slowest 5% of requests)

These numbers help you balance quality vs speed.

Different models can vary significantly in price.

Your evaluation harness should track:

  • tokens used
  • cost per request
  • cost per 1000 queries

Sometimes a slightly weaker model is the better choice if it is 10× cheaper.

Many production systems require structured outputs like JSON.

Your evaluation harness should track:

  • JSON validity
  • schema adherence
  • formatting consistency

A model that frequently breaks JSON can cause downstream failures.

While you can build a simple evaluation harness yourself, several tools help automate the process. These are the top-tier options currently available in 2026:

Promptfoo

Promptfoo - An open-source framework for evaluating prompts and models with automated tests and scoring.

LangSmith

LangSmith - Evaluation and observability platform for LLM applications, often used with LangChain.

DeepEval

DeepEval - A developer-focused Pytest-like framework for testing LLM applications with 14+ research-backed metrics.

Arize Phoenix

Arize Phoenix - Open-source observability that includes specialized evaluators for hallucinations, QA, and RAG relevance.

Ragas

Ragas - The industry standard for evaluating RAG pipelines, measuring context precision and answer faithfulness.

Galileo

Galileo - An enterprise-grade platform with built-in guardrails, real-time safety monitoring, and low-cost Luna-2 evaluation models.

Langfuse

Langfuse - Open-source observability and evaluation platform that bridges developer traces with production metrics.

Deepchecks

Deepchecks - Focused on continuous evaluation to detect drift and regressions in production Generative AI systems.

These tools provide features such as:

  • Automated evaluation pipelines
  • Prompt testing and versioning
  • Dataset management
  • Experiment tracking and drift detection

You do not need a complex infrastructure to start evaluating models.

A minimal evaluation harness can be built with just three components:

  1. Evaluation dataset
  2. Model runner
  3. Evaluation script

Start by collecting 20–100 real examples from your application.

Each example should include:

  • input prompt
  • expected output
  • optional metadata

Example dataset format:

[
{
"input": "Summarize this meeting transcript...",
"expected": "The meeting discussed roadmap planning..."
},
{
"input": "Convert this requirement into user stories",
"expected": "As a user, I want..."
}
]

This dataset becomes your ground truth for comparing models.

Step 2 – Run Multiple Models on the Same Dataset

Section titled “Step 2 – Run Multiple Models on the Same Dataset”

Next, run each prompt through multiple models.

For example:

  • GPT-4
  • Claude
  • Gemini
  • open-source models

The goal is to generate outputs for each model so you can compare them.

Example pseudo-code:

models = [
"gpt-4",
"claude-3",
"gemini-pro"
]
for example in dataset:
for model in models:
response = call_model(model, example["input"])
save_output(model, example, response)

Now you have a matrix of outputs for comparison.

There are several ways to score outputs:

The most reliable method is manual scoring.

Example:

ModelAccuracyClarityHallucination
GPT-499Low
Claude89Low
Gemini77Medium

Even simple scoring provides valuable insights.

For structured outputs, automated checks work well.

Example validations:

  • JSON parsing
  • schema validation
  • SQL query execution
  • regex patterns

These checks can run automatically in your evaluation pipeline.

Another method is using an LLM to evaluate outputs.

Example prompt:

You are evaluating AI responses.
Score the response from 1–10 based on:
- correctness
- completeness
- hallucinations
Input:
{prompt}
Response:
{model_output}

While not perfect, this approach can scale evaluation across hundreds of examples.

A typical evaluation workflow might look like this:

  1. Collect 50 real prompts from your product.
  2. Run them across 3–5 models.
  3. Store outputs in a spreadsheet or database.
  4. Score results using human review or automated checks.
  5. Compare metrics (accuracy, latency, cost).

This process often reveals surprising differences between models.

For example:

  • Model A produces the best answers but is slow
  • Model B is slightly weaker but much cheaper
  • Model C fails on structured outputs

These tradeoffs help you choose the right model for your application.

Once you have a working evaluation harness, you can integrate it into continuous integration pipelines.

Typical workflow:

  1. Developer modifies prompts or model configuration
  2. CI pipeline runs evaluation dataset
  3. Results are compared with baseline scores
  4. Deployment only proceeds if metrics do not degrade

This approach prevents silent quality regressions when prompts or models change.

Many teams run into similar problems when evaluating AI models.

Synthetic prompts rarely reflect real user behavior. Always include real production examples.

Testing only 5–10 prompts is unreliable. Aim for at least 30–50 examples.

Model quality is important, but production systems must also consider:

  • response speed
  • infrastructure costs
  • rate limits

Early evaluation should include human review. Fully automated scoring often misses subtle errors.

Model performance changes frequently as providers release updates.

You should re-run evaluations when:

  • switching models
  • updating prompts
  • adding new features
  • upgrading infrastructure

Regular evaluations ensure your system continues delivering reliable outputs.

Building a simple evaluation harness is one of the most valuable steps in production AI engineering.

Instead of relying on public benchmarks alone, you test models against your actual workload, which reveals the tradeoffs that matter most for your product.

A minimal setup can be created with:

  • a small evaluation dataset
  • a script to run multiple models
  • a scoring method for outputs

Once this foundation is in place, you can iterate quickly and make data-driven decisions about AI models.

How large should my evaluation dataset be?

Section titled “How large should my evaluation dataset be?”

Start with 30–100 real examples. Quality matters more than size in early evaluations.

Should I use automated scoring or human review?

Section titled “Should I use automated scoring or human review?”

Use both. Automated scoring scales better, but human review provides deeper insight.

Can I evaluate models before building my full application?

Section titled “Can I evaluate models before building my full application?”

Yes. Many teams create evaluation harnesses before building full AI features, allowing them to choose models early.

At minimum, run them whenever you:

  • change prompts
  • switch models
  • update retrieval pipelines
  • deploy major features

Regular evaluation helps maintain consistent model quality in production.