Yoke Agent — AI Quality Studio

Why the name

Poka-Yoke, for AI systems.

Shigeo Shingo coined poka-yoke (ポカヨケ) at Toyota in the 1960s: a physical jig on the assembly line that stops a defective part from leaving. Yoke Agent is that jig, applied to AI — it makes it impossible to ship a worse configuration than the one you measured.

The promise

Don’t ship what you hope works. Ship what you measured.

01

Scientific, not anecdotal

Every claim about a configuration is backed by a reproducible grid-search and a numeric score. You stop arguing about which prompt feels better and start comparing 0.87 against 0.82.

02

Two disciplines, one studio

RAG tuning and agent evaluation normally live in separate tools and separate notebooks. Yoke Agent unifies them — shared providers, shared cost tracking, shared report format.

03

Human-in-the-loop where it matters

LLMs propose — datasets, grids, reports — and humans approve. The philosophy is “LLM as a fast junior analyst”, not “LLM as an unchecked oracle”.

Workbenches

One studio.
Two grid-search workbenches.

Each workbench is a guided pipeline: ingest → generate tests → grid-search → score → improvement report. No blank canvas, no notebook sprawl.

RAG workbench

Tune your retrieval
until the numbers agree.

01

Ingest

PDF, DOCX, TXT, Markdown via LangChain loaders. Five vector stores: Chroma, Pinecone, Qdrant, pgvector, Astra DB.
02

Generate & review dataset

LLM-written Q&A pairs from your docs. Edit, add, remove, reorder. Or import a curated JSON set.
03

Grid-search

Chunking, embeddings, retrievers, rerankers. Advanced strategies wired in as first-class axes — not plugins.

HyDE CRAG Self-RAG RAPTOR GraphRAG Multi-Query Agentic
04

Score with RAGAS + custom hooks

Faithfulness, answer relevancy, context precision/recall. Optional correctness, similarity, entity recall, noise sensitivity. Plus your own metrics via a hook.
05

Read the improvement report

Plain-language recommendation on which configuration to ship and why.

Agent workbench

Stress-test agents
with simulated users.

01

Define your agent

System prompt, tools (native + MCP servers), guardrails. fastmcp on both sides — consume or expose.
02

Pick personas & scenarios

Synthetic users with distinct goals and styles. Four scenario types, each with its own rubric.

tool_use decision_path response_quality guardrail
03

Simulate in parallel

Every grid combination runs against every persona. Full transcripts and tool invocations captured.
04

Score with G-Eval + deterministic overrides

14 rubric metrics from tool-call accuracy to refusal accuracy and hallucination. Tool use is parsed from TOOL_CALL invocations, not guessed from prose.
05

Drill into the leaderboard

All-metrics ranking, collapsible improvement report, per-transcript drill-down with tool traces.

Platform features

Governance is the default code path.

Cost tracking, observability and multi-tenancy aren’t bolted on — they’re how runs are instrumented by default.

Multi-provider by default

OpenAI, Anthropic, Google Gemini, Cohere, Azure, Ollama, HuggingFace, Claude Code CLI, and any OpenAI-compatible endpoint (LM Studio, vLLM). Keys live in the backend DB, never the browser.

Flex tier, detected automatically

Opt-in ~50% discount on OpenAI o3/o4-mini/gpt-4.1 and Gemini 2.5.x/3.x. Flex detection is logged per-call — no manual bookkeeping.

Token & cost tracking

Every LLM and embedding call is logged with estimated USD cost. Sweeps that look cheap up front stop being surprises at the end of the month.

OpenTelemetry & OTLP

Production-grade observability out of the box. Pipe runs into Grafana, Honeycomb or Datadog without writing glue.

Multi-tenancy & auth

Workspace isolation with first-class authentication. One installation, multiple teams, clean boundaries.

Failure categorization

Fine-grained breakdown of why a run failed — not just that it did. Patterns surface across configurations.

Self-hosted, no data egress

Your documents, questions and transcripts stay on your infrastructure. Only LLM calls leave, to providers you explicitly configure.

REST + CLI + MCP

Fully documented REST API (/docs, /redoc), yoke-agent/cli for scripted workflows, MCP both consumed and exposed.

Docker Compose, one command

make dev or docker compose up and you have the full stack: FastAPI backend, Next.js dashboard, workers and Chroma ready to go.

Why teams pick it

Ten reasons it beats a Jupyter notebook.

Most “evaluation tools” stop at chunk-size and top-k. Yoke Agent goes a lot further — and keeps the receipts.

01

Scientific, not anecdotal

Every configuration claim is backed by a reproducible grid search and a numeric score. 0.87 vs 0.82, not vibes.
02

RAG and agents, one dashboard

Shared providers, shared cost tracking, shared improvement-report format. Two disciplines that finally share a home.
03

Provider-agnostic by design

Nine providers, including fully-local (Ollama, LM Studio) and CLI (Claude Code) — no API key required to start.
04

Built-in cost governance

Token + cost logging is the default path. Flex-tier detection auto-logs the discount when providers return it.
05

State-of-the-art retrieval

HyDE, CRAG, Self-RAG, RAPTOR, GraphRAG, Multi-Query, Agentic — first-class grid axes, not add-ons.
06

Human-in-the-loop by default

LLMs propose, humans approve. Datasets, grids and reports all go through a review step before anything locks.
07

Objective tool-use scoring

Tool calls parsed deterministically from TOOL_CALL invocations. Accuracy never depends on a judge’s reading of prose.
08

Self-hosted, zero egress

Your docs, questions and transcripts never leave your infra. Only LLM calls go out — to endpoints you explicitly configure.
09

Production observability

OpenTelemetry + OTLP means runs fit into the stack you already operate. Grafana, Honeycomb, Datadog — no glue needed.
10

Open source & extensible

MIT-licensed. Pydantic schemas, clean FastAPI routers, pluggable retrievers, custom-metrics hook. Meant to be read and forked.

Honest trade-offs

A product page without trade-offs is marketing, not a product page.

These are the real constraints. Pin them up before you plan your rollout.

Self-hosted only

No managed SaaS, no hosted SSO tier, no hands-off upgrades. You run Docker Compose or make dev yourself.

Judge-bound quality

RAGAS and G-Eval rely on an LLM-as-judge. Relative rankings across a grid are trustworthy; absolute scores aren’t ground truth.

Grid cost scales fast

4×3×4×3 = 144 runs; with 20 questions that’s 2,880 LLM calls before the judge. No built-in spend cap — scope your sweeps.

Vector-store breadth = setup

Five options is great until you pick one. Chroma is the sensible default; Pinecone/Qdrant/pgvector/Astra you operate yourself.

GraphRAG & RAPTOR are heavy

Powerful, but computationally expensive on large corpora. Expect long first-run indexing and real memory pressure.

Personas are synthetic

LLM-driven simulated users surface many real failure modes — but they don’t replace actual users, and under-represent long-tail oddness.

Young, fast-moving surface

Features land quickly (multi-tenancy, OTel, custom metrics, failure categorization). Pin versions and read the notes before upgrading.

At a glance

The whole stack on one page.

Category	AI Quality Studio — RAG + Agent evaluation
License	Open source · see `LICENSE`
Deployment	Self-hosted · Docker Compose or `make dev`
Backend	Python 3.11+ · FastAPI · SQLAlchemy · Pydantic v2
Frontend	Next.js 14 · React 18 · TypeScript · Tailwind
Storage	SQLite default · Postgres via `DATABASE_URL`
Providers	OpenAI · Anthropic · Gemini · Cohere · Azure · Ollama · HF · Claude Code CLI · custom OpenAI-compatible
Vector stores	ChromaDB · Pinecone · Qdrant · pgvector · Astra DB
RAG eval	RAGAS — 4 fixed + 4 optional metrics + custom-metrics hook
Agent eval	G-Eval — 14 rubric metrics + deterministic tool-call parsing
Observability	OpenTelemetry · OTLP export
Multi-tenancy	Yes · with auth

Get started

Stop vibe-checking prompts.
Start measuring agents.

Clone the repo, run make dev, and you’ll be grid-searching your first RAG pipeline in about ten minutes.

Terminal

$ git clone https://github.com/yoke-agent/yoke-agent.git
$ cd yoke-agent
$ make dev
# → dashboard at http://localhost:3000
# → API docs at http://localhost:4040/docs

Star on GitHub

Mistake-proof your AI. Ship only the configuration that won.

Poka-Yoke, for AI systems.

Scientific, not anecdotal

Two disciplines, one studio

Human-in-the-loop where it matters

One studio.Two grid-search workbenches.

Ingest

Generate & review dataset

Grid-search

Score with RAGAS + custom hooks

Read the improvement report

Define your agent

Pick personas & scenarios

Simulate in parallel

Score with G-Eval + deterministic overrides

Drill into the leaderboard

Governance is the default code path.

Multi-provider by default

Flex tier, detected automatically

Token & cost tracking

OpenTelemetry & OTLP

Multi-tenancy & auth

Failure categorization

Self-hosted, no data egress

REST + CLI + MCP

Docker Compose, one command

Ten reasons it beats a Jupyter notebook.

Scientific, not anecdotal

RAG and agents, one dashboard

Provider-agnostic by design

Built-in cost governance

State-of-the-art retrieval

Human-in-the-loop by default

Objective tool-use scoring

Self-hosted, zero egress

Production observability

Open source & extensible

A product page without trade-offs is marketing, not a product page.

The whole stack on one page.

Stop vibe-checking prompts.Start measuring agents.

Mistake-proof your AI.
Ship only the configuration that won.

One studio.
Two grid-search workbenches.

Stop vibe-checking prompts.
Start measuring agents.