withCachedContext() method lets you separate the stable
parts of a conversation from the per-call messages, so the provider can cache and reuse
them across requests.
How Context Caching Works
Without caching, every request includes the full conversation history, system prompt, and any reference material. As conversations grow, this can lead to significant token overhead. Context caching solves this by marking a portion of the request as a reusable prefix. The provider stores this prefix server-side and references it on subsequent requests, reducing both the number of tokens processed and the time to first token.Using Cached Context
ThewithCachedContext() method accepts the same kinds of data you would normally pass through
with(), but treats them as a persistent prefix for subsequent requests:
cacheWriteTokens reported).
Every subsequent request that shares the same prefix benefits from a cache hit, reflected in
cacheReadTokens.
What Can Be Cached
The cached context can include any combination of:- messages — system prompts, conversation history, or reference material
- tools — tool/function definitions that remain constant across calls
- toolChoice — the tool selection strategy
- responseFormat — a fixed response schema
Processing Large Documents
Context caching is particularly valuable when working with large documents. Instead of resending the full document with every question, you cache it once and issue lightweight follow-up queries:Inspecting Cache Usage
If a provider reports cache usage, you can inspect it throughresponse()->usage(). The
InferenceUsage object exposes the following cache-related fields when available:
| Field | Description |
|---|---|
cacheReadTokens | Tokens served from the cache (cache hit) |
cacheWriteTokens | Tokens written to the cache (cache miss / first request) |
Provider Support
Different providers handle context caching differently:| Provider | Caching Behavior | Cache Metrics |
|---|---|---|
| Anthropic | Explicit cache markers with native support | Full reporting (cacheReadTokens, cacheWriteTokens) |
| OpenAI | Automatic server-side prompt caching | Limited reporting; no opt-in required |
| Other providers | No native caching | Polyglot manages conversation state correctly; no cache metrics |
withCachedContext() still works correctly —
the context is prepended to each request — but you will not see cache-related usage metrics
in the response.
Tip: To maximize cache hit rates with Anthropic, keep your cached context stable across requests. Even small changes to the cached portion will invalidate the cache and trigger a new cache write.