Evaluation & Analysis

Two related "operate" services for measuring AI quality offline:

Evaluation — regression-test a model or agent against a fixed dataset with deterministic and LLM-judge scorers. Answers "did this change make the model better or worse?"
Analysis — extract structured fields from real conversations, judge their quality, and score extraction accuracy against ground truth — on demand or on a nightly schedule. Answers "what is happening across yesterday's calls, and is quality holding?"

They are independent services that share the same architectural shape and plug into the same alerting pipeline.

Evaluation lives under Operate → Evaluations, organised into Targets, Datasets, Suites, and Runs:

Evaluations

Analysis lives under Operate → Analysis, organised into Definitions, Conversations, and Runs:

Conversation Analysis

Both screens above show the empty state of a fresh project; the primary button (New target / New definition) bootstraps the first object.

Architecture

Both services are built in the same four layers, each independently testable:

┌─ Dashboard UI ──────────────  /dashboard/evaluations, /dashboard/analysis
│   tabbed pages, create modals, run viewers
├─ REST API (Fastify plugin) ──  /api/evaluation/*, /api/analysis/*
│   validation, session + project scope
├─ Service ────────────────────  src/lib/services/{evaluation,analysis}/service.ts
│   tenant-scoped CRUD + run orchestration + live model adapters
└─ Engine core (pure, DI) ─────  runner / scorers / extraction / judge / accuracy
    no DB, queue, or model-runtime imports — the model is injected

The engine core is deliberately free of platform coupling: the model call is passed in as an invoker function. This keeps scoring logic unit-testable without a database or live model, and lets the service layer inject either the real handleChatCompletion adapter or a fake in tests.

Evaluation

Data model

Entity	Purpose
Target	What is under test: a `model`, `agent`, or `external` endpoint.
Dataset	Ordered test items: `input` messages + optional `expected`.
Suite	Binds a target + dataset + `scorers[]` (+ `judgeModelKey`).
Run	One execution of a suite: per-item scores + an aggregate.

Scorers

assertion — deterministic checks against expected: mustContain, equals, regex, jsonSchema, jsonPath.
llm-judge — an LLM grades the output against a rubric, normalised to 0–1, backed by the suite's judgeModelKey.

Per-item score is the weighted mean of the scorers; the item passed when all scorers pass. The run aggregate exposes passRate, avgScore, avgLatencyMs.

Walkthrough

Targets → New target — pick model and a registered model key.
Datasets → New dataset — paste a JSON array of items.
Suites → New suite — choose the target and dataset, enable assertion and/or LLM-judge (with a rubric and judge model).
Run from the suite row → the run viewer shows pass/fail, score, the per-scorer breakdown, and the model output for each item.

Model targets are live today. Agent and external targets can be created now; their execution adapters are recorded as per-item errors until they ship, so a run never aborts midway.

Analysis

Data model

Entity	Purpose
Definition	The recipe: `fieldSet`, extraction prompt, `modes`, models, optional `schedule`.
Conversation	An ingested transcript (`{ role, content }[]`) with optional `referenceFields`.
Run	One execution of a definition over a set of conversations + an aggregate.

The four modes

Mode	Effect
extract	Always on. Pulls the `fieldSet` from each transcript as typed JSON; each field is coerced to its declared type (`string`/`number`/`boolean`/`enum`) and required fields are validated.
store	Writes the extracted fields back onto the conversation (`extractedFields`, `lastAnalyzedAt`) so they can be browsed and queried later.
judge	An LLM grades each conversation against a rubric (0–1).
accuracy	Compares extracted fields to the conversation's `referenceFields`, per field, returning a 0–1 score and a per-field match map.

The aggregate exposes passRate (extraction success + judge pass), plus avgJudgeScore and avgExtractionAccuracy (averaged only over items that used those modes).

Walkthrough

Definitions → New definition — build the field-set (key/type/required, enum values), choose modes, set the extraction model (and judge model + rubric if judging). Optionally set a cron schedule.
Conversations → Ingest — paste a JSON array of transcripts. Add referenceFields to any conversation you want to score for accuracy.
Run analysis from a definition row → the run viewer shows the extracted fields, judge score, and accuracy per conversation.

Automation

The IVR use case — "every night, analyze the day's calls and alert me if quality drops" — is covered by two independent, composable mechanisms.

Scheduled runs

A definition with schedule: { cron, enabled } is fired automatically by the background analysis scheduler:

The scheduler runs on a 60s interval, guarded by a distributed lock (use CACHE_PROVIDER=redis for multi-instance deployments) so a single instance fires each tick. It is started from the server bootstrap.
For each tenant it loads scheduled definitions and fires any that are due. "Due" is decided by the pure schedulePlanner: a cron slot fires at most once, compared against the definition's most recent run. So 0 2 * * * runs once per night even though the scheduler ticks every minute.
Scheduled runs analyze the recent conversation corpus with createdBy: "system".

Cron expressions are standard 5-field, evaluated in UTC. Set the schedule via the definition create/update API (schedule) or the dashboard.

Threshold alerts

Both services expose their run aggregates to the existing alert pipeline as metric collectors — no new alert logic is involved. Create an alert rule (in the Alerts service) on the analysis or evaluation module:

Module	Metric	Meaning (0–100)
`analysis`	`analysis_pass_rate`	Mean pass rate over completed runs in the window.
`analysis`	`analysis_avg_judge_score`	Mean judge score.
`analysis`	`analysis_avg_accuracy`	Mean extraction accuracy.
`evaluation`	`evaluation_pass_rate`	Mean pass rate.
`evaluation`	`evaluation_avg_score`	Mean weighted score.

The collectors average the persisted run aggregate over completed runs in the rule's window (excluding runs where the metric is null), honouring the project scope. The existing alert scheduler/evaluator then applies the rule's condition and fires through its channels.

Putting it together: a definition scheduled at 0 2 * * * plus an alert rule analysis_avg_accuracy lt 85 over 1440 minutes gives you a nightly analysis that pages you when extraction quality slips below 85%.

Multi-tenancy & persistence

Every entity is tenant-scoped and persisted through the dual-provider database layer (MongoDB documents or SQLite JSON columns) with full parity. Runs embed their per-item results and aggregate. Reads and writes always go through switchToTenant, so one tenant never sees another's targets, datasets, conversations, or runs.

Where things live

Area	Path
Evaluation engine	`src/lib/services/evaluation/`
Analysis engine	`src/lib/services/analysis/`
Schedule planner	`src/lib/services/analysis/schedulePlanner.ts`
Analysis scheduler	`src/lib/services/analysis/analysisScheduler.ts`
Alert collectors	`src/lib/services/alerts/metrics/{analysis,evaluation}Collector.ts`
DB mixins	`src/lib/database/{mongodb,sqlite}/{evaluation,analysis}.mixin.ts`
API plugins	`src/server/api/plugins/{evaluations,analysis}.ts`
Dashboard UI	`src/app/dashboard/{evaluations,analysis}/`

See the API references for Evaluation and Analysis.

Evaluation & Analysis ​

Architecture ​

Evaluation ​

Data model ​

Scorers ​

Walkthrough ​

Analysis ​

Data model ​

The four modes ​

Walkthrough ​

Automation ​

Scheduled runs ​

Threshold alerts ​

Multi-tenancy & persistence ​

Where things live ​

Evaluation & Analysis

Architecture

Evaluation

Data model

Scorers

Walkthrough

Analysis

Data model

The four modes

Walkthrough

Automation

Scheduled runs

Threshold alerts

Multi-tenancy & persistence

Where things live