Promptfoo
Promptfoo - An open-source framework for evaluating prompts and models with automated tests and scoring.

Public benchmarks and leaderboards are useful, but they rarely reflect your real production workload.
A model that performs well on public datasets might fail when exposed to your prompts, your documents, and your users’ behavior.
That’s why serious AI teams build a small evaluation harness to test models on their own data before deploying anything.
This guide explains how to create a simple, practical LLM evaluation setup that lets you compare models like GPT, Claude, Gemini, and open-source alternatives using your real tasks.
Benchmark platforms measure models using standard datasets. Those results are useful for broad comparison, but production workloads often look very different.
For example:
These real tasks are rarely captured by public benchmarks.
A practical evaluation harness focuses on a few high-signal metrics rather than dozens of academic benchmarks.
The most important metric is simply:
Did the model produce the correct result?
Examples:
Accuracy is often measured using:
Large language models sometimes confidently generate incorrect information.
You should track:
For applications like documentation assistants or research tools, hallucinations can be more dangerous than simple errors.
A model that produces great answers but takes 8 seconds per request may not be usable in real products.
Measure:
These numbers help you balance quality vs speed.
Different models can vary significantly in price.
Your evaluation harness should track:
Sometimes a slightly weaker model is the better choice if it is 10× cheaper.
Many production systems require structured outputs like JSON.
Your evaluation harness should track:
A model that frequently breaks JSON can cause downstream failures.
While you can build a simple evaluation harness yourself, several tools help automate the process. These are the top-tier options currently available in 2026:
Promptfoo
Promptfoo - An open-source framework for evaluating prompts and models with automated tests and scoring.
LangSmith
LangSmith - Evaluation and observability platform for LLM applications, often used with LangChain.
DeepEval
DeepEval - A developer-focused Pytest-like framework for testing LLM applications with 14+ research-backed metrics.
Arize Phoenix
Arize Phoenix - Open-source observability that includes specialized evaluators for hallucinations, QA, and RAG relevance.
Ragas
Ragas - The industry standard for evaluating RAG pipelines, measuring context precision and answer faithfulness.
Galileo
Galileo - An enterprise-grade platform with built-in guardrails, real-time safety monitoring, and low-cost Luna-2 evaluation models.
Langfuse
Langfuse - Open-source observability and evaluation platform that bridges developer traces with production metrics.
Deepchecks
Deepchecks - Focused on continuous evaluation to detect drift and regressions in production Generative AI systems.
These tools provide features such as:
You do not need a complex infrastructure to start evaluating models.
A minimal evaluation harness can be built with just three components:
Start by collecting 20–100 real examples from your application.
Each example should include:
Example dataset format:
[ { "input": "Summarize this meeting transcript...", "expected": "The meeting discussed roadmap planning..." }, { "input": "Convert this requirement into user stories", "expected": "As a user, I want..." }]This dataset becomes your ground truth for comparing models.
Next, run each prompt through multiple models.
For example:
The goal is to generate outputs for each model so you can compare them.
Example pseudo-code:
models = [ "gpt-4", "claude-3", "gemini-pro"]
for example in dataset: for model in models: response = call_model(model, example["input"]) save_output(model, example, response)Now you have a matrix of outputs for comparison.
There are several ways to score outputs:
The most reliable method is manual scoring.
Example:
| Model | Accuracy | Clarity | Hallucination |
|---|---|---|---|
| GPT-4 | 9 | 9 | Low |
| Claude | 8 | 9 | Low |
| Gemini | 7 | 7 | Medium |
Even simple scoring provides valuable insights.
For structured outputs, automated checks work well.
Example validations:
These checks can run automatically in your evaluation pipeline.
Another method is using an LLM to evaluate outputs.
Example prompt:
You are evaluating AI responses.
Score the response from 1–10 based on:- correctness- completeness- hallucinations
Input:{prompt}
Response:{model_output}While not perfect, this approach can scale evaluation across hundreds of examples.
A typical evaluation workflow might look like this:
This process often reveals surprising differences between models.
For example:
These tradeoffs help you choose the right model for your application.
Once you have a working evaluation harness, you can integrate it into continuous integration pipelines.
Typical workflow:
This approach prevents silent quality regressions when prompts or models change.
Many teams run into similar problems when evaluating AI models.
Synthetic prompts rarely reflect real user behavior. Always include real production examples.
Testing only 5–10 prompts is unreliable. Aim for at least 30–50 examples.
Model quality is important, but production systems must also consider:
Early evaluation should include human review. Fully automated scoring often misses subtle errors.
Model performance changes frequently as providers release updates.
You should re-run evaluations when:
Regular evaluations ensure your system continues delivering reliable outputs.
Building a simple evaluation harness is one of the most valuable steps in production AI engineering.
Instead of relying on public benchmarks alone, you test models against your actual workload, which reveals the tradeoffs that matter most for your product.
A minimal setup can be created with:
Once this foundation is in place, you can iterate quickly and make data-driven decisions about AI models.
Start with 30–100 real examples. Quality matters more than size in early evaluations.
Use both. Automated scoring scales better, but human review provides deeper insight.
Yes. Many teams create evaluation harnesses before building full AI features, allowing them to choose models early.
At minimum, run them whenever you:
Regular evaluation helps maintain consistent model quality in production.