Vision

This example shows the message shape for multimodal input with an image URL and a text prompt.

Use this when

your model supports image input
you want to understand multimodal message format before putting it behind a full agent
you need the smallest possible vision example

What it shows

image content in the message payload
a model adapter capable of multimodal reasoning
final output returned through the normal invoke flow

Run it

bash

cd examples
npm run example:vision

Core code

const message = {
	role: "user",
	content: [
		{ type: "text", text: "What does this image contain?" },
		{
			type: "image_url",
			image_url: "https://fastly.picsum.photos/id/237/200/300.jpg?hmac=TmmQSbShHz9CdQm0NkEjx1Dyh_Y984R9LpNrpvH2D_U",
		},
	],
} as const;

const response = await model.invoke([message as any]);

End-to-end flow

A multimodal-capable model is adapted.
The user message combines text and an image_url block.
The model receives the structured content array.
The response is returned through the same adapter contract as text-only calls.

Why it matters

Vision support is not a separate framework path. It is the same agent runtime with a model that can understand image-bearing messages.

How it works

The example talks directly to the adapted model instead of a full agent because the focus is multimodal message shape. Once that shape works, you can use the same content structure inside normal agent invocations.

Production takeaway

Get the message shape right first. After that, vision can be layered into normal agent workflows without inventing a new runtime model.

Expected output

the model returns a short description of the image content
the response arrives through the same adapter contract as text-only usage

Common failure modes

OPENAI_API_KEY is missing, so the script exits immediately
the selected provider model does not support image inputs
the image URL is inaccessible or blocked, leading to a degraded response

Vision ​

Use this when ​

What it shows ​

Run it ​

Core code ​

End-to-end flow ​

Why it matters ​

How it works ​

Production takeaway ​

Expected output ​

Common failure modes ​