Evaluation API
Offline testing for models and agents. Define a target (what to test), a dataset (the test cases), and a suite (binds a target + dataset + one or more scorers), then run the suite to produce a scored result.
All endpoints are under /api/evaluation/* and are session-authenticated (dashboard surface). Requests are tenant- and project-scoped from the session.
Concepts
target → a model | agent | external endpoint under test
dataset → an ordered list of items { input messages, expected? }
suite → target + dataset + scorers[] (+ judge model)
run → one execution of a suite over its dataset, with aggregate + per-item scoresScorers
| Type | What it checks |
|---|---|
assertion | Deterministic checks against expected: mustContain, equals, regex, jsonSchema, jsonPath. |
llm-judge | An LLM grades the output against a rubric (0–1), backed by judgeModelKey. |
A run's per-item score is the weighted mean of its scorer scores; passed is true when every scorer passes. The aggregate reports passRate, avgScore, and avgLatencyMs.
Targets
GET /api/evaluation/targets?kind=model&search=gpt
POST /api/evaluation/targets
GET /api/evaluation/targets/:id
PATCH /api/evaluation/targets/:id
DELETE /api/evaluation/targets/:idCreate
{
"name": "GPT-4o production",
"kind": "model",
"modelKey": "gpt-4o"
}| Field | Type | Required | Notes |
|---|---|---|---|
name | string | yes | Display name; key is slugified from it. |
kind | model | agent | external | yes | model is live; agent/external are recorded as per-item errors until their adapters ship. |
modelKey | string | for model | Registered model key. |
agentKey | string | for agent | Registered agent key. |
Datasets
GET /api/evaluation/datasets?search=faq
POST /api/evaluation/datasets
GET /api/evaluation/datasets/:id
PATCH /api/evaluation/datasets/:id
DELETE /api/evaluation/datasets/:idCreate
{
"name": "FAQ regression",
"items": [
{
"id": "q1",
"input": [{ "role": "user", "content": "What is 2+2?" }],
"expected": { "mustContain": ["4"] }
}
]
}items[].input is an array of { role, content } chat messages. expected is optional and consumed by the assertion scorer.
Suites
GET /api/evaluation/suites?search=faq
POST /api/evaluation/suites
GET /api/evaluation/suites/:id
PATCH /api/evaluation/suites/:id
DELETE /api/evaluation/suites/:idCreate
{
"name": "FAQ accuracy",
"targetKey": "gpt-4o-production",
"datasetKey": "faq-regression",
"scorers": [
{ "type": "assertion" },
{ "type": "llm-judge", "rubric": "Answer is correct and concise." }
],
"judgeModelKey": "gpt-4o"
}| Field | Type | Required | Notes |
|---|---|---|---|
targetKey / datasetKey | string | yes | Keys of an existing target and dataset. |
scorers | array | yes | Non-empty; each { type, weight?, rubric?, threshold? }. |
judgeModelKey | string | when an llm-judge scorer is present | Model used for grading. |
runConfig.concurrency | number | no | Parallel items per run. |
Runs
Run a suite
POST /api/evaluation/suites/:key/runThis dashboard-surface run is asynchronous: it loads the suite, target, and dataset, enqueues the run, and returns immediately with HTTP 202 Accepted and a pending run. The background worker then executes the target over every item with bounded concurrency, scores each output, and persists the result. Poll GET /api/evaluation/runs/:id to watch it finish.
The token-authenticated client surface (
POST /api/client/v1/evaluation/suites/{key}/run, below) stays synchronous and returns201with the completed run.
Response — 202 Accepted
{
"run": {
"id": "…",
"suiteKey": "faq-accuracy",
"status": "pending",
"aggregate": null,
"items": []
}
}Once completed, the run carries its aggregate and per-item scores:
{
"run": {
"id": "…",
"suiteKey": "faq-accuracy",
"status": "completed",
"aggregate": { "total": 12, "completed": 12, "failed": 0, "passed": 11, "passRate": 0.916, "avgScore": 0.94, "avgLatencyMs": 410 },
"items": [
{ "itemId": "q1", "passed": true, "score": 1, "scores": [ { "scorerType": "assertion", "score": 1, "passed": true, "weight": 1 } ], "output": { "text": "4", "latencyMs": 380 } }
]
}
}A target/judge error on an item is recorded on that item (error) and counted in aggregate.failed; it does not abort the run.
Generate a dataset
POST /api/evaluation/datasets/generateSynthesizes dataset items (e.g. from a prompt or seed) and returns the generated dataset.
List / get runs
GET /api/evaluation/runs?suiteKey=faq-accuracy&limit=50
GET /api/evaluation/runs/:idCompare runs
GET /api/evaluation/runs/:id/compare?baseline=<runId>Returns a per-item and aggregate diff between run :id and the baseline run, for tracking regressions across runs of the same suite.
Client API (token-authenticated)
For CI / automation there is a second, API-token-authenticated surface under /api/client/v1/evaluation/*. Authenticate with Authorization: Bearer <token> (create one in Settings → API Tokens); calls are scoped to the token's project and gated by the evaluations permission. Fields are snake_case, like the other client modules.
This surface is read- and trigger-oriented: you discover suites, run one, and read results. Authoring targets, datasets and suites stays on the session-authenticated dashboard surface above.
List suites
GET /api/client/v1/evaluation/suites{
"suites": [
{
"key": "faq-accuracy",
"name": "FAQ Accuracy",
"target_key": "gpt-target",
"dataset_key": "faq-set",
"judge_model_key": "judge-1",
"scorers": [{ "type": "assertion" }, { "type": "llm-judge", "rubric": "…" }],
"created_at": "…"
}
]
}Run a suite
Runs the suite synchronously over its dataset and returns the scored result. The suite, its target and its dataset must already exist.
POST /api/client/v1/evaluation/suites/{key}/run{
"run": {
"id": "run_abc123",
"suite_key": "faq-accuracy",
"target_key": "gpt-target",
"dataset_key": "faq-set",
"status": "completed",
"aggregate": {
"total": 24, "completed": 24, "failed": 0, "passed": 22,
"pass_rate": 0.9167, "avg_score": 0.94, "avg_latency_ms": 812
},
"items": [
{
"item_id": "q1", "passed": true, "score": 1, "latency_ms": 640,
"output_text": "…",
"scores": [{ "scorer_type": "assertion", "score": 1, "passed": true, "weight": 1 }]
}
],
"started_at": "…", "finished_at": "…"
}
}List / get runs
/runs returns summaries (aggregate only); fetch a single run by id for the full per-item breakdown.
GET /api/client/v1/evaluation/runs?suite_key=faq-accuracy&limit=20
GET /api/client/v1/evaluation/runs/{id}CI example
const BASE = 'https://your-cognipeer-host';
const headers = { Authorization: `Bearer ${process.env.COGNIPEER_API_TOKEN}` };
const { run } = await fetch(
`${BASE}/api/client/v1/evaluation/suites/faq-accuracy/run`,
{ method: 'POST', headers },
).then((r) => r.json());
if ((run.aggregate?.pass_rate ?? 0) < 0.9) {
console.error(`Evaluation gate failed: ${run.aggregate.passed}/${run.aggregate.total}`);
process.exit(1);
}Alerting
Run aggregates feed the alert system through the evaluation module:
| Metric | Source |
|---|---|
evaluation_pass_rate | aggregate.passRate × 100, averaged over completed runs in the window. |
evaluation_avg_score | aggregate.avgScore × 100, same. |
Create an alert rule (module evaluation) such as evaluation_pass_rate lt 80 to be notified when quality regresses. See the Evaluation & Analysis guide.
Errors
| Status | Cause |
|---|---|
| 400 | Missing name, bad kind, empty scorers, non-array items. |
| 404 | Target / dataset / suite / run not found (or unknown suiteKey on run). |
| 500 | Internal error. |