Scientific, not anecdotal
Every claim about a configuration is backed by a reproducible grid-search and a numeric score. You stop arguing about which prompt feels better and start comparing 0.87 against 0.82.
Yoke Agent turns the guesswork of is this prompt, chunking, retriever, or agent any good? into a reproducible grid-search workflow — with RAGAS scores, G-Eval rubrics and an improvement report at the end.
OpenAI, Anthropic, Gemini, Cohere, Azure, Ollama, HF, Claude Code, custom.
HyDE, CRAG, Self-RAG, RAPTOR, GraphRAG, Multi-Query, Agentic.
G-Eval rubrics + deterministic TOOL_CALL accuracy parsing.
Shigeo Shingo coined poka-yoke (ポカヨケ) at Toyota in the 1960s: a physical jig on the assembly line that stops a defective part from leaving. Yoke Agent is that jig, applied to AI — it makes it impossible to ship a worse configuration than the one you measured.
Don’t ship what you hope works. Ship what you measured.
Every claim about a configuration is backed by a reproducible grid-search and a numeric score. You stop arguing about which prompt feels better and start comparing 0.87 against 0.82.
RAG tuning and agent evaluation normally live in separate tools and separate notebooks. Yoke Agent unifies them — shared providers, shared cost tracking, shared report format.
LLMs propose — datasets, grids, reports — and humans approve. The philosophy is “LLM as a fast junior analyst”, not “LLM as an unchecked oracle”.
Each workbench is a guided pipeline: ingest → generate tests → grid-search → score → improvement report. No blank canvas, no notebook sprawl.
PDF, DOCX, TXT, Markdown via LangChain loaders. Five vector stores: Chroma, Pinecone, Qdrant, pgvector, Astra DB.
LLM-written Q&A pairs from your docs. Edit, add, remove, reorder. Or import a curated JSON set.
Chunking, embeddings, retrievers, rerankers. Advanced strategies wired in as first-class axes — not plugins.
Faithfulness, answer relevancy, context precision/recall. Optional correctness, similarity, entity recall, noise sensitivity. Plus your own metrics via a hook.
Plain-language recommendation on which configuration to ship and why.
System prompt, tools (native + MCP servers), guardrails. fastmcp on both sides — consume or expose.
Synthetic users with distinct goals and styles. Four scenario types, each with its own rubric.
Every grid combination runs against every persona. Full transcripts and tool invocations captured.
14 rubric metrics from tool-call accuracy to refusal accuracy and hallucination. Tool use is parsed from TOOL_CALL invocations, not guessed from prose.
All-metrics ranking, collapsible improvement report, per-transcript drill-down with tool traces.
Cost tracking, observability and multi-tenancy aren’t bolted on — they’re how runs are instrumented by default.
OpenAI, Anthropic, Google Gemini, Cohere, Azure, Ollama, HuggingFace, Claude Code CLI, and any OpenAI-compatible endpoint (LM Studio, vLLM). Keys live in the backend DB, never the browser.
Opt-in ~50% discount on OpenAI o3/o4-mini/gpt-4.1 and Gemini 2.5.x/3.x. Flex detection is logged per-call — no manual bookkeeping.
Every LLM and embedding call is logged with estimated USD cost. Sweeps that look cheap up front stop being surprises at the end of the month.
Production-grade observability out of the box. Pipe runs into Grafana, Honeycomb or Datadog without writing glue.
Workspace isolation with first-class authentication. One installation, multiple teams, clean boundaries.
Fine-grained breakdown of why a run failed — not just that it did. Patterns surface across configurations.
Your documents, questions and transcripts stay on your infrastructure. Only LLM calls leave, to providers you explicitly configure.
Fully documented REST API (/docs, /redoc), yoke-agent/cli for scripted workflows, MCP both consumed and exposed.
make dev or docker compose up and you have the full stack: FastAPI backend, Next.js dashboard, workers and Chroma ready to go.
Most “evaluation tools” stop at chunk-size and top-k. Yoke Agent goes a lot further — and keeps the receipts.
Every configuration claim is backed by a reproducible grid search and a numeric score. 0.87 vs 0.82, not vibes.
Shared providers, shared cost tracking, shared improvement-report format. Two disciplines that finally share a home.
Nine providers, including fully-local (Ollama, LM Studio) and CLI (Claude Code) — no API key required to start.
Token + cost logging is the default path. Flex-tier detection auto-logs the discount when providers return it.
HyDE, CRAG, Self-RAG, RAPTOR, GraphRAG, Multi-Query, Agentic — first-class grid axes, not add-ons.
LLMs propose, humans approve. Datasets, grids and reports all go through a review step before anything locks.
Tool calls parsed deterministically from TOOL_CALL invocations. Accuracy never depends on a judge’s reading of prose.
Your docs, questions and transcripts never leave your infra. Only LLM calls go out — to endpoints you explicitly configure.
OpenTelemetry + OTLP means runs fit into the stack you already operate. Grafana, Honeycomb, Datadog — no glue needed.
MIT-licensed. Pydantic schemas, clean FastAPI routers, pluggable retrievers, custom-metrics hook. Meant to be read and forked.
These are the real constraints. Pin them up before you plan your rollout.
No managed SaaS, no hosted SSO tier, no hands-off upgrades. You run Docker Compose or make dev yourself.
RAGAS and G-Eval rely on an LLM-as-judge. Relative rankings across a grid are trustworthy; absolute scores aren’t ground truth.
4×3×4×3 = 144 runs; with 20 questions that’s 2,880 LLM calls before the judge. No built-in spend cap — scope your sweeps.
Five options is great until you pick one. Chroma is the sensible default; Pinecone/Qdrant/pgvector/Astra you operate yourself.
Powerful, but computationally expensive on large corpora. Expect long first-run indexing and real memory pressure.
LLM-driven simulated users surface many real failure modes — but they don’t replace actual users, and under-represent long-tail oddness.
Features land quickly (multi-tenancy, OTel, custom metrics, failure categorization). Pin versions and read the notes before upgrading.
| Category | AI Quality Studio — RAG + Agent evaluation |
|---|---|
| License | Open source · see LICENSE |
| Deployment | Self-hosted · Docker Compose or make dev |
| Backend | Python 3.11+ · FastAPI · SQLAlchemy · Pydantic v2 |
| Frontend | Next.js 14 · React 18 · TypeScript · Tailwind |
| Storage | SQLite default · Postgres via DATABASE_URL |
| Providers | OpenAI · Anthropic · Gemini · Cohere · Azure · Ollama · HF · Claude Code CLI · custom OpenAI-compatible |
| Vector stores | ChromaDB · Pinecone · Qdrant · pgvector · Astra DB |
| RAG eval | RAGAS — 4 fixed + 4 optional metrics + custom-metrics hook |
| Agent eval | G-Eval — 14 rubric metrics + deterministic tool-call parsing |
| Observability | OpenTelemetry · OTLP export |
| Multi-tenancy | Yes · with auth |
Clone the repo, run make dev, and you’ll be
grid-searching your first RAG pipeline in about ten minutes.
$ git clone https://github.com/yoke-agent/yoke-agent.git
$ cd yoke-agent
$ make dev
# → dashboard at http://localhost:3000
# → API docs at http://localhost:4040/docs