Skip to content

Evaluation & Analysis

Two related "operate" services for measuring AI quality offline:

  • Evaluation — regression-test a model or agent against a fixed dataset with deterministic and LLM-judge scorers. Answers "did this change make the model better or worse?"
  • Analysis — extract structured fields from real conversations, judge their quality, and score extraction accuracy against ground truth — on demand or on a nightly schedule. Answers "what is happening across yesterday's calls, and is quality holding?"

They are independent services that share the same architectural shape and plug into the same alerting pipeline.

Evaluation lives under Operate → Evaluations, organised into Targets, Datasets, Suites, and Runs:

Evaluations

Analysis lives under Operate → Analysis, organised into Definitions, Conversations, and Runs:

Conversation Analysis

Both screens above show the empty state of a fresh project; the primary button (New target / New definition) bootstraps the first object.

Architecture

Both services are built in the same four layers, each independently testable:

┌─ Dashboard UI ──────────────  /dashboard/evaluations, /dashboard/analysis
│   tabbed pages, create modals, run viewers
├─ REST API (Fastify plugin) ──  /api/evaluation/*, /api/analysis/*
│   validation, session + project scope
├─ Service ────────────────────  src/lib/services/{evaluation,analysis}/service.ts
│   tenant-scoped CRUD + run orchestration + live model adapters
└─ Engine core (pure, DI) ─────  runner / scorers / extraction / judge / accuracy
    no DB, queue, or model-runtime imports — the model is injected

The engine core is deliberately free of platform coupling: the model call is passed in as an invoker function. This keeps scoring logic unit-testable without a database or live model, and lets the service layer inject either the real handleChatCompletion adapter or a fake in tests.

Evaluation

Data model

EntityPurpose
TargetWhat is under test: a model, agent, or external endpoint.
DatasetOrdered test items: input messages + optional expected.
SuiteBinds a target + dataset + scorers[] (+ judgeModelKey).
RunOne execution of a suite: per-item scores + an aggregate.

Scorers

  • assertion — deterministic checks against expected: mustContain, equals, regex, jsonSchema, jsonPath.
  • llm-judge — an LLM grades the output against a rubric, normalised to 0–1, backed by the suite's judgeModelKey.

Per-item score is the weighted mean of the scorers; the item passed when all scorers pass. The run aggregate exposes passRate, avgScore, avgLatencyMs.

Walkthrough

  1. Targets → New target — pick model and a registered model key.
  2. Datasets → New dataset — paste a JSON array of items.
  3. Suites → New suite — choose the target and dataset, enable assertion and/or LLM-judge (with a rubric and judge model).
  4. Run from the suite row → the run viewer shows pass/fail, score, the per-scorer breakdown, and the model output for each item.

Model targets are live today. Agent and external targets can be created now; their execution adapters are recorded as per-item errors until they ship, so a run never aborts midway.

Analysis

Data model

EntityPurpose
DefinitionThe recipe: fieldSet, extraction prompt, modes, models, optional schedule.
ConversationAn ingested transcript ({ role, content }[]) with optional referenceFields.
RunOne execution of a definition over a set of conversations + an aggregate.

The four modes

ModeEffect
extractAlways on. Pulls the fieldSet from each transcript as typed JSON; each field is coerced to its declared type (string/number/boolean/enum) and required fields are validated.
storeWrites the extracted fields back onto the conversation (extractedFields, lastAnalyzedAt) so they can be browsed and queried later.
judgeAn LLM grades each conversation against a rubric (0–1).
accuracyCompares extracted fields to the conversation's referenceFields, per field, returning a 0–1 score and a per-field match map.

The aggregate exposes passRate (extraction success + judge pass), plus avgJudgeScore and avgExtractionAccuracy (averaged only over items that used those modes).

Walkthrough

  1. Definitions → New definition — build the field-set (key/type/required, enum values), choose modes, set the extraction model (and judge model + rubric if judging). Optionally set a cron schedule.
  2. Conversations → Ingest — paste a JSON array of transcripts. Add referenceFields to any conversation you want to score for accuracy.
  3. Run analysis from a definition row → the run viewer shows the extracted fields, judge score, and accuracy per conversation.

Automation

The IVR use case — "every night, analyze the day's calls and alert me if quality drops" — is covered by two independent, composable mechanisms.

Scheduled runs

A definition with schedule: { cron, enabled } is fired automatically by the background analysis scheduler:

  • The scheduler runs on a 60s interval, guarded by a distributed lock (use CACHE_PROVIDER=redis for multi-instance deployments) so a single instance fires each tick. It is started from the server bootstrap.
  • For each tenant it loads scheduled definitions and fires any that are due. "Due" is decided by the pure schedulePlanner: a cron slot fires at most once, compared against the definition's most recent run. So 0 2 * * * runs once per night even though the scheduler ticks every minute.
  • Scheduled runs analyze the recent conversation corpus with createdBy: "system".

Cron expressions are standard 5-field, evaluated in UTC. Set the schedule via the definition create/update API (schedule) or the dashboard.

Threshold alerts

Both services expose their run aggregates to the existing alert pipeline as metric collectors — no new alert logic is involved. Create an alert rule (in the Alerts service) on the analysis or evaluation module:

ModuleMetricMeaning (0–100)
analysisanalysis_pass_rateMean pass rate over completed runs in the window.
analysisanalysis_avg_judge_scoreMean judge score.
analysisanalysis_avg_accuracyMean extraction accuracy.
evaluationevaluation_pass_rateMean pass rate.
evaluationevaluation_avg_scoreMean weighted score.

The collectors average the persisted run aggregate over completed runs in the rule's window (excluding runs where the metric is null), honouring the project scope. The existing alert scheduler/evaluator then applies the rule's condition and fires through its channels.

Putting it together: a definition scheduled at 0 2 * * * plus an alert rule analysis_avg_accuracy lt 85 over 1440 minutes gives you a nightly analysis that pages you when extraction quality slips below 85%.

Multi-tenancy & persistence

Every entity is tenant-scoped and persisted through the dual-provider database layer (MongoDB documents or SQLite JSON columns) with full parity. Runs embed their per-item results and aggregate. Reads and writes always go through switchToTenant, so one tenant never sees another's targets, datasets, conversations, or runs.

Where things live

AreaPath
Evaluation enginesrc/lib/services/evaluation/
Analysis enginesrc/lib/services/analysis/
Schedule plannersrc/lib/services/analysis/schedulePlanner.ts
Analysis schedulersrc/lib/services/analysis/analysisScheduler.ts
Alert collectorssrc/lib/services/alerts/metrics/{analysis,evaluation}Collector.ts
DB mixinssrc/lib/database/{mongodb,sqlite}/{evaluation,analysis}.mixin.ts
API pluginssrc/server/api/plugins/{evaluations,analysis}.ts
Dashboard UIsrc/app/dashboard/{evaluations,analysis}/

See the API references for Evaluation and Analysis.

Community edition is AGPL-3.0. Commercial licensing and support are available separately.