Skip to content

Monitoring & Observability

Cognipeer Console provides structured logging, health checks, and usage tracking for production observability. The operator-facing surfaces for this live under Operate → Alerts & Incidents, Configure → Audit Log, and Configure → License.

Alerts & Incidents

This is where you define threshold-based rules that trigger when a metric crosses a configured boundary. Rules are grouped by signal source — Model Hub, Model Monitoring, Guardrail, Knowledge Engine, MCP Servers — so you can target a specific subsystem.

Alerts list

The three counters at the top capture the operational state at a glance: how many rules are active, how many are disabled, and how many fired in the last 24 hours. From here you can either author a rule (New Rule), inspect open incidents that need acknowledgement (Incidents), or jump to the full audit timeline (View History).

Alert history

The history view shows every firing event, who acknowledged it, and how it resolved. Filters let you scope by rule, severity, source, or date range.

Alerts history

Each row links back to the underlying signal — for an inference alert that's a model in Model Hub, for a guardrail alert that's the policy in Guardrails — so triage stays one click away from the configuration that triggered it.

Audit Log

Every state-changing action on the console — provider edits, model deployments, prompt promotions, token issuance, license updates, member changes — is recorded in the tenant audit log.

Audit log

Columns surface the actor, the action, the resource type and ID, and the timestamp. Use the filters at the top to narrow by actor, action, or resource type when responding to a compliance question or investigating a configuration drift.

License

The License screen shows what the current installation is allowed to do — plan tier, configured project budget versus active projects, expiry, and the signed license key payload.

License screen

In an offline-enterprise deployment you paste the signed token here; the runtime verifies it on every startup against the bundled public key and enforces the limits described in Licensing. Reset to free drops back to the bundled FREE license — useful for evaluating, not for production.

Cluster topology

Multi-node deployments add a second observability surface: the node registry and the per-entity instance assignments. The Cluster page in Admin → Cluster shows every running process, the queue provider in use (in-process vs BullMQ), and which agents, MCP servers, browsers, JS runtimes, inference servers, alert rules, and automations are pinned to which node. The same data is available programmatically via GET /api/cluster/overview and GET /api/cluster/instances.

Structured Logging

All server-side code uses the Winston-based structured logger:

typescript
import { createLogger } from '@/lib/core/logger';
const logger = createLogger('my-service');

Log Format

JSON (production):

json
{
  "timestamp": "2025-01-15 10:30:45.123",
  "level": "info",
  "scope": "inference",
  "requestId": "a1b2c3d4-5678-90ab-cdef",
  "tenantId": "tenant_acme",
  "message": "Chat completion",
  "model": "gpt-4",
  "tokens": 150,
  "latencyMs": 450
}

Pretty (development):

2025-01-15 10:30:45.123 info [inference](a1b2c3d4){tenant_acme} Chat completion {"model":"gpt-4","tokens":150}

Configuration

VariableDefaultDescription
LOG_LEVELdebug (dev) / info (prod)error, warn, info, debug
LOG_FORMATpretty (dev) / json (prod)json or pretty
LOG_REQUEST_BODYfalseLog request body (sanitized)
LOG_RESPONSE_BODYfalseLog response body (sanitized)

Log Aggregation

JSON logs are compatible with common aggregation tools:

  • ELK Stack — Elasticsearch, Logstash, Kibana
  • Datadog — Direct JSON log ingestion
  • Grafana Loki — Label-based log aggregation
  • CloudWatch — AWS native log ingestion

Per-request fields (requestId, tenantId, scope) enable filtering and correlation across services.

Health Checks

Endpoints

EndpointPurposeResponse
GET /api/health/liveLiveness (is process alive?)Always 200
GET /api/health/readyReadiness (are dependencies healthy?)200 or 503

Health Report

json
{
  "status": "ok",
  "uptime": 86400,
  "timestamp": "2025-01-15T10:30:00.000Z",
  "checks": {
    "mongodb": { "status": "ok", "latencyMs": 5 },
    "cache": { "status": "ok", "latencyMs": 2, "details": { "provider": "redis" } }
  }
}

Status Values

StatusMeaning
okComponent is healthy
degradedComponent is working but with issues
downComponent is unavailable

Request Tracing

Every request gets a unique requestId (UUID) that flows through the entire processing chain:

Client → Middleware (set x-request-id)
       → Route Handler
         → Service Layer
           → Provider Call
             → All log entries include requestId

Clients can provide their own request ID via the X-Request-Id header or request_id body field.

Circuit Breaker Monitoring

Circuit breaker states are available programmatically:

typescript
import { getAllCircuitStates } from '@/lib/core/resilience';

const states = getAllCircuitStates();
// Map<string, { state: 'closed'|'open'|'half-open', failures: number }>

Monitor for open circuits to identify unhealthy providers.

Usage Tracking

Every LLM inference request is logged with:

FieldDescription
Model keyWhich model was used
ProviderWhich provider handled the request
Token countsInput, output, total, cached
LatencyEnd-to-end request time (ms)
StatusSuccess or error
Tool callsAny tool/function calls made
Request IDFor correlation

Usage data is written asynchronously via fireAndForget to avoid impacting response latency.

Runtime Pool Stats

typescript
import { runtimePool } from '@/lib/core/runtimePool';

const { size, keys } = runtimePool.stats();
// size: number of cached provider SDK instances
// keys: list of cache keys

Async Task Monitoring

typescript
import { pendingTaskCount } from '@/lib/core/asyncTask';

const count = pendingTaskCount();
// Number of fire-and-forget tasks still running

Alerts

The gateway includes an alerting system with configurable rules:

  • Alert Rules — Define conditions (thresholds, patterns) that trigger alerts
  • Alert Channels — Define notification targets (email, webhook, etc.)
  • Alert Events — Historical record of triggered alerts
ComponentPurpose
JSON LogsLog aggregation + search
Health endpointsKubernetes probes + uptime monitoring
Usage trackingDashboard analytics + billing
Circuit breaker statesProvider health monitoring
Async task countBackground processing health

Community edition is AGPL-3.0. Commercial licensing and support are available separately.