Inference Lifecycle
1. Request Construction
The lifecycle begins when the application builds anInferenceRequest through the Inference facade:
InferenceRequestBuilder that accumulates parameters. Every with*() call returns a new immutable copy, so the original instance is never modified.
2. Creating a Pending Handle
Callingcreate() (or a shortcut like get() or response()) builds the InferenceRequest and passes it to the runtime:
InferenceRuntime wraps the request in an InferenceExecution object and returns a PendingInference handle. Execution is still deferred — no HTTP call has been sent yet.
The InferenceExecution tracks the full lifecycle state: the original request, retry attempts, usage accumulation, and the final response.
3. Triggering Execution
The HTTP call is triggered only when you read from thePendingInference:
PendingInference delegates to InferenceExecutionSession, which orchestrates the full lifecycle.
4. The Execution Session
TheInferenceExecutionSession is the heart of the lifecycle. It performs these steps for a non-streaming request:
- Dispatches
InferenceStarted— signals the beginning of the operation, including the execution ID, request details, and whether streaming is enabled - Dispatches
InferenceAttemptStarted— signals the beginning of an attempt with the attempt number and model - Calls the driver —
driver->makeResponseFor($request)triggers the full request-response cycle:- The driver’s request adapter converts
InferenceRequestinto anHttpRequest - The HTTP client sends the request to the provider
- The driver’s response adapter normalizes the raw
HttpResponseinto anInferenceResponse
- The driver’s request adapter converts
- Checks the response — if the finish reason indicates a failure (error, content filter, or length limit), the session handles it according to the retry policy
- Dispatches success events:
InferenceResponseCreated— the response is readyInferenceAttemptSucceeded— the attempt completed, including finish reason and usageInferenceUsageReported— token usage (InferenceUsage) is reported with the model nameInferenceCompleted— the entire operation is done, including total attempt count and timing
- Returns
InferenceResponseto the caller
FlatRateCostCalculator with InferencePricing data from the LLMConfig, rather than being attached to the usage object in the pipeline.
5. Retry Handling
If the request fails with a retryable error (transient HTTP status, timeout, network error, or provider-classified retriable exception), the session:- Records the failure on the execution object
- Dispatches
InferenceAttemptFailed— with the error details, HTTP status code, partial usage, andwillRetry: true - Waits for the configured delay (exponential backoff with optional jitter)
- Dispatches a new
InferenceAttemptStartedand retries
InferenceCompleted with isSuccess: false and throws the terminal error.
Length-limit recovery has special handling. When a response finishes with Length as the finish reason and the retry policy allows length recovery, the session can:
'continue'— append the partial response as an assistant message, add a continuation prompt, and retry'increase_max_tokens'— increase themax_tokensoption by the configured increment and retry
lengthMaxAttempts.
6. Cached Context
If the request includes aCachedInferenceContext, the driver applies it before sending. Cached context allows you to pre-configure messages, tools, tool choice, and response format that are prepended to or merged with the request’s own values. This is particularly useful for system prompts or shared tool definitions that remain constant across calls.
Streaming Lifecycle
When streaming is enabled, the flow diverges after the HTTP request is sent:PendingInference::stream()validates that streaming was requested, then creates anInferenceStream- The driver produces an iterable of
PartialInferenceDeltaobjects from the SSE event stream viadriver->makeStreamDeltasFor($request) - The
InferenceStreamtracks visibility state through aVisibilityTrackerand yields only deltas with meaningful changes (filtering out empty or duplicate deltas)
Stream Events
The stream dispatches events as deltas arrive:StreamFirstChunkReceived— when the first visible delta arrives, including the request start time for TTFC measurementPartialInferenceDeltaCreated— for each visible deltaInferenceResponseCreated— when the stream finishes and the final response is assembled from accumulated state
Stream Processing
The stream supports functional-style processing throughmap(), reduce(), and filter():
Delta Callback
You can register a callback that fires for every visible delta:Stream Finalization
Callingfinal() on a stream that has not been fully consumed will drain the remaining deltas first, ensuring the final response is complete. A stream can only be consumed once — calling deltas() a second time throws a LogicException.
The final response assembled from the stream goes through the same event dispatch as a synchronous response.
Embeddings Lifecycle
The embeddings lifecycle is simpler since streaming is not involved:Embeddingsbuilds anEmbeddingsRequestfrom the configured inputs, model, and optionscreate()returnsPendingEmbeddings— a lazy handle that holds the request, driver, and event dispatcherget()triggers execution:- The driver’s
handle()method sends the HTTP request - The response body is decoded and passed to
driver->fromData()to build anEmbeddingsResponse EmbeddingsResponseReceivedis dispatched
- The driver’s
EmbeddingsResponseis returned — containing vectors and usage
PendingEmbeddings based on the EmbeddingsRetryPolicy attached to the request. The retry loop follows the same exponential backoff pattern as inference retries.
Response Caching
Both the inference and embeddings lifecycles support response caching. WhenResponseCachePolicy is set on the request, the InferenceExecutionSession caches the response after the first successful execution. Subsequent calls to response() or get() on the same PendingInference return the cached result without making another HTTP call.
deltas() a second time will throw a LogicException. However, final() always returns the assembled response, which is stored in the execution object.