Skip to content

Crawler API

Save reusable web crawler profiles, manage their seed URLs, trigger runs, and read the resulting pages as Markdown. API-token authenticated.

All endpoints live under /api/client/v1/crawler/* and require a Bearer token:

http
Authorization: Bearer cpeer_…

A crawler is a saved container that holds crawl configuration (engine, depth/page limits, scope filters, HTTP options, RAG binding, webhook, schedule). Running a crawler — or starting an ad-hoc run — enqueues a job, and each fetched page or file is stored as a result (extracted as Markdown for HTML).

By default the client API runs async: a run enqueues the job and returns immediately with { "jobId": "…", "status": "queued" }. Poll the job, or supply a callbackUrl/webhook to be notified. Pass "mode": "sync" to block until the crawl finishes.

Crawler config fields

These appear on create/update and inside a job's frozen planSnapshot:

FieldTypeDefaultNotes
namestringDisplay name (required on create).
keystringfrom nameURL-friendly id, unique per tenant/project. ^[a-z0-9][a-z0-9_-]*$.
descriptionstringUp to 2000 chars.
seedsstring[] (url)[]Initial URL list. Up to 500. URLs can also be managed via /urls.
engineaxios | playwright | autoautoFetch engine; playwright renders JS.
maxDepthint 0–30Link-follow depth. 0 = only the given URLs.
maxPagesint 0–500050Page cap. 0 = unlimited.
autoCrawlbooleanfalseFollow discovered links within scope.
scopeobjectsameDomainOnly (default true), includeSubdomains (default false), allowList[], blockList[] host globs.
httpobjectuserAgent, acceptLanguage, timeoutMs (1000–120000), maxConcurrency (1–16), retries (1–5), headers, cookies[], basicAuth, bearerToken, allowPrivateNetwork.
downloadableMimesstring[]MIME types treated as downloadable files.
markdownOptionsobject{ ocr: { enabled, languages? } } forwarded to the Markdown extractor.
ragobject{ ragModuleKey, enabled } — ingest crawled pages into a RAG module.
webhookobject{ url, secret?, events[] } where events are page, completed, failed.
scheduleobject{ mode: interval|cron, enabled, intervalSeconds?, cron?, startAt?, endAt? }. Interval mode needs intervalSeconds (≥60); cron mode needs cron.
metadataobjectArbitrary key/value bag.

Crawlers

List

http
GET /api/client/v1/crawler/crawlers?status=active&search=docs
QueryTypeNotes
statusactive | disabledFilter by status.
searchstringMatch on name/key.

Response

json
{
  "crawlers": [
    {
      "id": "665f…",
      "key": "docs-site",
      "name": "Docs site",
      "status": "active",
      "engine": "auto",
      "maxDepth": 1,
      "maxPages": 200,
      "autoCrawl": true,
      "seeds": ["https://example.com/docs"],
      "scope": { "sameDomainOnly": true, "includeSubdomains": false },
      "createdAt": "2026-06-15T10:00:00.000Z"
    }
  ]
}

Create

http
POST /api/client/v1/crawler/crawlers
json
{
  "name": "Docs site",
  "key": "docs-site",
  "seeds": ["https://example.com/docs"],
  "engine": "auto",
  "maxDepth": 1,
  "maxPages": 200,
  "autoCrawl": true,
  "scope": { "sameDomainOnly": true },
  "rag": { "ragModuleKey": "support-kb", "enabled": true },
  "webhook": { "url": "https://app.example.com/hooks/crawl", "events": ["completed", "failed"] }
}

Returns 201 with { "crawler": { … } }.

Get

http
GET /api/client/v1/crawler/crawlers/:idOrKey

Accepts either the crawler id or its key. Returns { "crawler": { … } }, or 404 if not found.

Update

http
PATCH /api/client/v1/crawler/crawlers/:idOrKey

Partial update. Accepts the same fields as create plus status (active | disabled). Set rag, webhook, or schedule to null to clear them. Returns { "crawler": { … } }.

Delete

http
DELETE /api/client/v1/crawler/crawlers/:idOrKey

Returns 204 on success, 404 if not found.

URLs

List, add, or remove the crawler's saved URL list. A crawler is a container, so URLs can be managed independently of runs.

http
GET    /api/client/v1/crawler/crawlers/:idOrKey/urls
POST   /api/client/v1/crawler/crawlers/:idOrKey/urls
DELETE /api/client/v1/crawler/crawlers/:idOrKey/urls

Body for POST / DELETE:

json
{ "urls": ["https://example.com/docs", "https://example.com/blog"] }
FieldTypeNotes
urlsstring[] (url)1–500 URLs to add or remove.

All three return the updated list: { "urls": ["…"] }.

Run

Run a saved crawler. Enqueues a job using the crawler's config.

http
POST /api/client/v1/crawler/crawlers/:idOrKey/run
json
{
  "urls": ["https://example.com/changelog"],
  "callbackUrl": "https://app.example.com/hooks/crawl",
  "mode": "async",
  "metadata": { "source": "ci" }
}
FieldTypeNotes
urlsstring[] (url)Optional. Overrides the saved URL list for this run (max 500).
seedsstring[] (url)Legacy alias for urls.
callbackUrlstring (url)Per-run webhook receiver.
modesync | asyncDefaults to async.
metadataobjectStored on the job.

Response (202)

json
{ "jobId": "6660…", "status": "queued" }

Crawl

Crawl an explicit set of URLs using a saved crawler's config — "give me the Markdown for these URLs". Functionally a run with required urls.

http
POST /api/client/v1/crawler/crawlers/:idOrKey/crawl
json
{
  "urls": ["https://example.com/page-1", "https://example.com/page-2"],
  "mode": "sync"
}
FieldTypeNotes
urlsstring[] (url)Required, 1–500 URLs.
callbackUrlstring (url)Per-run webhook receiver.
modesync | asyncDefaults to async.
metadataobjectStored on the job.

Returns 202 with { "jobId": "…", "status": "queued" }.

Ad-hoc run

Start a one-off crawl without saving a crawler.

http
POST /api/client/v1/crawler/run
json
{
  "seeds": ["https://example.com/article"],
  "engine": "auto",
  "maxDepth": 0,
  "maxPages": 20,
  "autoCrawl": false,
  "callbackUrl": "https://app.example.com/hooks/crawl",
  "mode": "async"
}
FieldTypeDefaultNotes
seedsstring[] (url)Required, 1–50 URLs.
engineaxios | playwright | autoauto
maxDepthint 0–30
maxPagesint 0–500020
autoCrawlbooleanfalse
scope / http / downloadableMimes / markdownOptions / rag / webhook / metadataSame shapes as crawler config.
callbackUrlstring (url)Per-run webhook receiver.
modesync | asyncasync

Returns 202 with { "jobId": "…", "status": "queued" }. The job has no crawlerKey.

Jobs & Results

A job moves through these statuses:

StatusMeaning
queuedEnqueued, not yet started.
runningCurrently crawling.
succeededFinished, no fatal errors.
partialFinished but some pages errored.
failedNo pages processed; the run failed.
canceledStopped via the cancel endpoint.

List jobs

http
GET /api/client/v1/crawler/jobs?crawlerKey=docs-site&status=succeeded&limit=20
QueryTypeNotes
crawlerKeystringFilter by parent crawler.
statusjob statusFilter by status.
limitnumberMax jobs to return.

Response

json
{
  "jobs": [
    {
      "id": "6660…",
      "crawlerKey": "docs-site",
      "trigger": "api",
      "status": "succeeded",
      "pagesDiscovered": 42,
      "pagesProcessed": 40,
      "filesProcessed": 2,
      "errorsCount": 0,
      "limitReached": false,
      "startedAt": "2026-06-15T10:01:00.000Z",
      "endedAt": "2026-06-15T10:02:30.000Z",
      "durationMs": 90000
    }
  ]
}

trigger is one of manual, api, adhoc, schedule.

Get job

http
GET /api/client/v1/crawler/jobs/:jobId

Returns { "job": { … } } (including counters, planSnapshot, errorMessage), or 404.

List results

http
GET /api/client/v1/crawler/jobs/:jobId/results?type=html&limit=100&skip=0
QueryTypeDefaultNotes
typehtml | file | errorFilter by result type.
limitnumber100Page size.
skipnumber0Offset.

Response

json
{
  "results": [
    {
      "id": "6661…",
      "jobId": "6660…",
      "url": "https://example.com/docs/intro",
      "parentUrl": "https://example.com/docs",
      "depth": 1,
      "type": "html",
      "httpStatus": 200,
      "contentType": "text/html",
      "title": "Introduction",
      "bodyMarkdown": "# Introduction\n\n…",
      "bytes": 4096,
      "ragDocumentId": "doc_abc",
      "ragStatus": "indexed",
      "fetchedAt": "2026-06-15T10:01:05.000Z"
    }
  ]
}

bodyMarkdown is present for html results. ragStatus is one of pending, indexed, skipped, failed (only when the crawler has a RAG binding). error results carry an errorMessage.

Get result

http
GET /api/client/v1/crawler/jobs/:jobId/results/:resultId

Returns { "result": { … } }, or 404.

Cancel job

http
POST /api/client/v1/crawler/jobs/:jobId/cancel

Requests cancellation of a queued/running job. Returns { "ok": true }, or 404 if the job is missing or not cancelable.

Errors

StatusCause
400Invalid body, duplicate key (already exists), crawler not active.
401Missing/invalid API token.
404Crawler, job, or result not found; job not cancelable.
500Internal error.

Example

bash
# 1. Create a crawler
curl -X POST https://console.cognipeer.com/api/client/v1/crawler/crawlers \
  -H "Authorization: Bearer cpeer_…" \
  -H "Content-Type: application/json" \
  -d '{
        "name": "Docs site",
        "key": "docs-site",
        "seeds": ["https://example.com/docs"],
        "maxDepth": 1,
        "maxPages": 200,
        "autoCrawl": true
      }'

# 2. Run it (async)
curl -X POST https://console.cognipeer.com/api/client/v1/crawler/crawlers/docs-site/run \
  -H "Authorization: Bearer cpeer_…" \
  -H "Content-Type: application/json" \
  -d '{ "mode": "async" }'
# → { "jobId": "6660…", "status": "queued" }

# 3. Poll the job
curl https://console.cognipeer.com/api/client/v1/crawler/jobs/6660… \
  -H "Authorization: Bearer cpeer_…"

# 4. Read the crawled pages as Markdown
curl "https://console.cognipeer.com/api/client/v1/crawler/jobs/6660…/results?type=html" \
  -H "Authorization: Bearer cpeer_…"

Community edition is AGPL-3.0. Commercial licensing and support are available separately.