AgentPantheon

Relari (YC W24)

Testing, evaluation, and synthetic data generation platform for AI agents.

4.3 (6)
Daniel NikulshynÉvalué par Daniel Nikulshyn·Mis à jour mai 2026

Aperçu

Relari is a developer platform focused on improving the reliability of AI agents through systematic testing and evaluation. It helps teams generate synthetic datasets, run automated evaluations, and benchmark agent performance across realistic scenarios before shipping to production. Backed by Y Combinator (W24), Relari targets engineering teams building complex LLM applications and multi-step agents where traditional QA falls short. Its tooling aims to bring software-engineering rigor—unit tests, regression checks, and measurable metrics—to non-deterministic AI systems. The platform supports custom evaluators, scenario simulation, and continuous monitoring, making it useful for both pre-launch validation and ongoing quality assurance of production agents.

Fonctionnalités clés

  • Synthetic dataset generation
  • Automated agent evaluation pipelines
  • Scenario and conversation simulation
  • Customizable evaluation metrics
  • Regression testing for LLM apps
  • Performance benchmarking and reporting

Pour & contre

Pour

  • Purpose-built for evaluating multi-step AI agents
  • Generates synthetic test data at scale
  • Supports custom metrics and evaluators
  • Backed by Y Combinator with active development

Contre

  • Primarily aimed at technical teams, not non-developers
  • Newer platform with an evolving feature set
  • May require integration work to fit existing stacks

Avis

4.3

Moyenne sur 6 avis.

5
2
4
4
3
0
2
0
1
0

Connecte-toi pour laisser un avis.

F

Fatima Zahra

Skeptical, then convinced

I went in skeptical — most tools in this space overpromise. It actually delivers on customizable evaluation metrics, and purpose-built for evaluating multi-step AI agents caught me off guard. still, I'd recommend giving it a real trial.

R

Robert Ainsworth

Solid for our team

We rolled this out across the team last quarter and supports custom metrics and evaluators. Customizable evaluation metrics fits neatly into how we already work, and customizable evaluation metrics removed a step we used to do by hand. Primarily aimed at technical teams, not non-developers, which is the main caveat, but it has held up under daily use.

D

Devin Walker

Skeptical, then convinced

I went in skeptical — most tools in this space overpromise. It actually delivers on performance benchmarking and reporting, and supports custom metrics and evaluators caught me off guard. Primarily aimed at technical teams, not non-developers is why this isn't a perfect score, still, I'd recommend giving it a real trial.

C

Carlos Mendoza

Compared a few options

Evaluated this against two competitors. Where it wins: scenario and conversation simulation and purpose-built for evaluating multi-step AI agents. Where it lags: may require integration work to fit existing stacks. On balance the feature set — especially scenario and conversation simulation — justifies the 5 stars for our use case.

Y

Yuki Mori

Use it every day

Honestly didn't expect to like it this much. Performance benchmarking and reporting is exactly what I needed, and purpose-built for evaluating multi-step AI agents. I do wish may require integration work to fit existing stacks, but I reach for it almost every day now and it just clicks.

L

Leila Hassan

Skeptical, then convinced

I went in skeptical — most tools in this space overpromise. It actually delivers on regression testing for LLM apps, and supports custom metrics and evaluators caught me off guard. May require integration work to fit existing stacks is why this isn't a perfect score, still, I'd recommend giving it a real trial.

Questions & réponses

Pas encore de question — sois le premier à demander.

Poser une question

Alternatives à Observability