Introduction
AgentV is a CLI-first AI agent evaluation framework. It evaluates your agents locally with multi-objective scoring (correctness, latency, cost, safety) from YAML specifications. Deterministic code graders + customizable LLM graders, all version-controlled in Git.
Why AgentV?
Section titled “Why AgentV?”Best for: Developers who want evaluation in their workflow, not a separate dashboard. Teams prioritizing privacy and reproducibility.
- No cloud dependency — everything runs locally
- No server — just install and run
- Version-controlled — YAML evaluation files live in Git alongside your code
- CI/CD ready — run evaluations in your pipeline without external API calls
- Multiple grader types — code validators, LLM graders, custom Python/TypeScript
How AgentV Compares
Section titled “How AgentV Compares”| Feature | AgentV | LangWatch | LangSmith | LangFuse |
|---|---|---|---|---|
| Setup | npx allagents plugin install | Cloud account + API key | Cloud account + API key | Cloud account + API key |
| Server | None (local) | Managed cloud | Managed cloud | Managed cloud |
| Privacy | All local | Cloud-hosted | Cloud-hosted | Cloud-hosted |
| CLI-first | Yes | No | Limited | Limited |
| CI/CD ready | Yes | Requires API calls | Requires API calls | Requires API calls |
| Version control | Yes (YAML in Git) | No | No | No |
| Graders | Code + LLM + Custom | LLM only | LLM + Code | LLM only |
Core Concepts
Section titled “Core Concepts”Evaluation files (.yaml or .jsonl) define test cases with expected outcomes. Targets specify which agent or provider to evaluate. Graders (code or LLM) score results. Results are written as JSONL/YAML for analysis and comparison.
Key Components
Section titled “Key Components”- Eval files — YAML or JSONL definitions of test cases
- Tests — Individual test entries with input messages and expected outcomes
- Targets — The agent or LLM provider being evaluated
- Graders — Code graders (Python/TypeScript) or LLM graders that score responses
- Rubrics — Structured criteria with weights for grading
- Results — JSONL output with scores, reasoning, and execution traces
AI agent navigation map
Section titled “AI agent navigation map”Use this topic map when you are an AI agent trying to decide which primitive or workflow to compose next:
| Goal | Start here | Why |
|---|---|---|
| Create a first eval | Quickstart → Eval files | Defines the smallest runnable YAML shape before adding advanced fields. |
| Run or resume evals | Running evals → WIP checkpoints | Covers agentv eval, concurrency, --resume, --rerun-failed, and remote partial-run recovery. |
| Choose graders | Rubrics → Code graders → LLM graders | Keeps deterministic checks, rubric scoring, and LLM judgment separate. |
| Evaluate tool use or agents | Tool trajectory → Coding agents → CLI provider | Shows how targets, transcripts, and tool-call assertions compose. |
| Share and inspect results | Results → Dashboard | Explains local artifacts, reports, remote result repositories, and Dashboard review flows. |
| Compare runs | Compare → Dashboard Analytics | Use CLI metrics for automation and Dashboard analytics for interactive inspection. |
| Govern or improve an agent workflow | Agent eval layers → Skill improvement workflow → Enterprise governance | Moves from primitive eval design to iterative agent improvement and governance checks. |
Navigation strategy recommendation
Section titled “Navigation strategy recommendation”Keep the public Astro/Starlight docs as AgentV’s canonical navigation layer, and add lightweight topic-map sections like the one above when agents need a faster path through related pages. This borrows the useful LLM Wiki convention of one-line index entries with dense cross-links, without introducing a separate wiki, custom schema, or runtime navigation code.
That is the smallest fit for the current docs: Starlight already provides the sidebar, URLs, search, and link validation, while the source MDX files remain reviewable in ordinary PRs. A full LLM Wiki-style knowledge graph would add duplicate source-of-truth and maintenance overhead before AgentV has enough public docs or contradictory source material to justify provenance tracking. Revisit a richer topic-map or wiki only if a docs section grows beyond a scannable page index, or if multiple sources need explicit confidence/contradiction metadata.
Features
Section titled “Features”- Multi-objective scoring: Correctness, latency, cost, safety in one run
- Multiple grader types: Code validators, LLM graders, custom Python/TypeScript
- Built-in targets: VS Code Copilot, Codex CLI, Pi Coding Agent, Azure OpenAI, local CLI agents
- Structured evaluation: Rubric-based grading with weights and requirements
- Batch evaluation: Run hundreds of test cases in parallel
- Export: JSON, JSONL, YAML formats
- Compare results: Compute deltas between evaluation runs for A/B testing