AgentPantheon

Relari (YC W24)

Testing, evaluation, and synthetic data generation platform for AI agents.

4.3 (6)
Daniel Nikulshyn审阅者 Daniel Nikulshyn·更新 2026年5月

概览

Relari is a developer platform focused on improving the reliability of AI agents through systematic testing and evaluation. It helps teams generate synthetic datasets, run automated evaluations, and benchmark agent performance across realistic scenarios before shipping to production. Backed by Y Combinator (W24), Relari targets engineering teams building complex LLM applications and multi-step agents where traditional QA falls short. Its tooling aims to bring software-engineering rigor—unit tests, regression checks, and measurable metrics—to non-deterministic AI systems. The platform supports custom evaluators, scenario simulation, and continuous monitoring, making it useful for both pre-launch validation and ongoing quality assurance of production agents.

主要功能

  • Synthetic dataset generation
  • Automated agent evaluation pipelines
  • Scenario and conversation simulation
  • Customizable evaluation metrics
  • Regression testing for LLM apps
  • Performance benchmarking and reporting

优点 & 缺点

优点

  • Purpose-built for evaluating multi-step AI agents
  • Generates synthetic test data at scale
  • Supports custom metrics and evaluators
  • Backed by Y Combinator with active development

缺点

  • Primarily aimed at technical teams, not non-developers
  • Newer platform with an evolving feature set
  • May require integration work to fit existing stacks

评测

4.3

6 个评分的平均值。

5
2
4
4
3
0
2
0
1
0

登录以留下评测。

F

Fatima Zahra

Skeptical, then convinced

I went in skeptical — most tools in this space overpromise. It actually delivers on customizable evaluation metrics, and purpose-built for evaluating multi-step AI agents caught me off guard. still, I'd recommend giving it a real trial.

R

Robert Ainsworth

Solid for our team

We rolled this out across the team last quarter and supports custom metrics and evaluators. Customizable evaluation metrics fits neatly into how we already work, and customizable evaluation metrics removed a step we used to do by hand. Primarily aimed at technical teams, not non-developers, which is the main caveat, but it has held up under daily use.

D

Devin Walker

Skeptical, then convinced

I went in skeptical — most tools in this space overpromise. It actually delivers on performance benchmarking and reporting, and supports custom metrics and evaluators caught me off guard. Primarily aimed at technical teams, not non-developers is why this isn't a perfect score, still, I'd recommend giving it a real trial.

C

Carlos Mendoza

Compared a few options

Evaluated this against two competitors. Where it wins: scenario and conversation simulation and purpose-built for evaluating multi-step AI agents. Where it lags: may require integration work to fit existing stacks. On balance the feature set — especially scenario and conversation simulation — justifies the 5 stars for our use case.

Y

Yuki Mori

Use it every day

Honestly didn't expect to like it this much. Performance benchmarking and reporting is exactly what I needed, and purpose-built for evaluating multi-step AI agents. I do wish may require integration work to fit existing stacks, but I reach for it almost every day now and it just clicks.

L

Leila Hassan

Skeptical, then convinced

I went in skeptical — most tools in this space overpromise. It actually delivers on regression testing for LLM apps, and supports custom metrics and evaluators caught me off guard. May require integration work to fit existing stacks is why this isn't a perfect score, still, I'd recommend giving it a real trial.

问答

暂无问题 — 来当第一个提问的人吧。

提问

Observability 的替代品