Inference (LLM & Embeddings)
The inference service provides OpenAI-compatible chat completion and embedding endpoints backed by multiple LLM providers. Models are configured in Model Hub and called through the runtime documented below.
Where inference is observed
Operators monitor live inference through Operate → Model Monitoring. The page summarises every connected inference server, splitting them by status: active, disabled, or errored. From here you also wire up new self-hosted endpoints (vLLM, TGI, llama.cpp, Ollama) that don't fit the cloud-provider model.

For per-call inspection — request body, completion, token usage, tool calls — open the Logs tab on a model's detail page in Model Hub, or query Agent Tracing for the full trace timeline.
Chat Completions
Endpoint
POST /api/client/v1/chat/completions
Authorization: Bearer <api-token>Request
{
"model": "gpt-4",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "What is the capital of France?" }
],
"temperature": 0.7,
"max_tokens": 1000,
"stream": false
}Response (Non-Streaming)
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"model": "gpt-4",
"choices": [
{
"index": 0,
"message": { "role": "assistant", "content": "The capital of France is Paris." },
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 8,
"total_tokens": 33
},
"request_id": "req_abc123"
}Streaming
Set "stream": true to receive Server-Sent Events:
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"content":"The"},"index":0}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"content":" capital"},"index":0}]}
data: [DONE]Embeddings
Endpoint
POST /api/client/v1/embeddings
Authorization: Bearer <api-token>Request
{
"model": "text-embedding-ada-002",
"input": "Hello world"
}Response
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0023, -0.0092, ...]
}
],
"model": "text-embedding-ada-002",
"usage": {
"prompt_tokens": 2,
"total_tokens": 2
}
}Processing Pipeline
Request → requireApiToken()
→ Resolve model by key
→ Validate model category (LLM vs embedding)
→ Guardrail check (if configured)
→ Semantic cache lookup (if enabled)
→ Build provider runtime (via runtimePool)
→ Execute with withResilience()
→ Convert to OpenAI format
→ Log usage (fireAndForget)
→ Return responseFeatures
Semantic Caching
When enabled on a model, similar queries return cached responses:
- Cache lookup before provider call
- Cache store after successful response
- Configurable similarity threshold
Guardrail Integration
Models can have guardrails attached that evaluate input before sending to the provider:
// If guardrail blocks the request
throw new GuardrailBlockError(guardrailKey, action, findings);Usage Logging
Every request is logged asynchronously (via fireAndForget):
- Token counts (prompt, completion, total)
- Latency (ms)
- Model and provider info
- Tool call metadata
- Request ID for correlation
Provider Resilience
External provider calls are wrapped with:
- Retry — Exponential backoff for transient failures
- Circuit breaker — Automatic rejection when provider is down
- Runtime pooling — Cached SDK clients for performance
Model Configuration
Models are configured in the dashboard with:
| Field | Description |
|---|---|
key | Unique model identifier per tenant |
category | llm or embedding |
providerKey | Which provider config to use |
modelId | Provider-specific model name |
pricing | Cost per 1M tokens (input/output) |
overrides | Default parameters (temperature, maxTokens, etc.) |
Error Handling
| Status | Meaning |
|---|---|
| 400 | Missing model key or invalid request body |
| 401 | Invalid or missing API token |
| 429 | Quota, rate limit, or budget exceeded |
| 500 | Provider/internal error (includes unresolved model, category mismatch, and circuit-breaker-open) |