The Four Layers
Public Layer
This is what application code usually touches. Two facade classes provide a unified interface for all provider interactions:Inference— for chat completions and text generationEmbeddings— for generating vector embeddings
Runtime Layer
Runtimes assemble the moving parts needed for a provider call. They wire together the configuration, driver, HTTP client, and event dispatcher, and they own the execution lifecycle including retry logic and response caching. The key classes are:InferenceRuntime— coordinates inference execution and createsPendingInferencehandlesEmbeddingsRuntime— coordinates embeddings execution and createsPendingEmbeddingshandles
HttpClientBuilder. Runtimes also expose onEvent() and wiretap() methods for hooking into the event system.
Request and Response Layer
Requests and responses are normalized into package data objects that are provider-agnostic:InferenceRequest— messages, model, tools, tool choice, response format, options, cached context, retry policy, response cache policyInferenceResponse— content, reasoning content, tool calls, usage, finish reason, raw HTTP response dataPartialInferenceDelta— a single streaming event delta with content, reasoning content, tool call fragments, finish reason, and usageEmbeddingsRequest— input texts, model, options, retry policyEmbeddingsResponse— vectors and usage
with*() mutators for building modified copies.
Driver Layer
Drivers translate Polyglot requests into provider-native HTTP payloads and normalize the results back. Each driver implementsCanProcessInferenceRequest (for inference) or CanHandleVectorization (for embeddings) and is composed of smaller adapter responsibilities:
- Request adapters (
CanTranslateInferenceRequest) — convertInferenceRequestinto anHttpRequest - Response adapters (
CanTranslateInferenceResponse) — convert rawHttpResponsedata intoInferenceResponseor stream ofPartialInferenceDelta - Message formatters (
CanMapMessages) — map typedMessagesto provider-specific structures, composing aMessageMapperutility for iteration - Body formatters (
CanMapRequestBody) — assemble the full request body with mode-specific adjustments - Usage formatters (
CanMapUsage) — extract token usage from provider responses
BaseInferenceRequestDriver, which provides the standard HTTP execution flow and stream handling. Provider-specific classes like OpenAIDriver, AnthropicDriver, and GeminiDriver compose the appropriate adapters and formatters for their API.
How the Layers Connect
Key Design Decisions
Immutability. Both the public facades and the request/response objects are immutable. CallingwithMessages() or withModel() always returns a new instance rather than modifying the original. This makes it safe to reuse a configured Inference or Embeddings instance across multiple concurrent calls.
Lazy execution. Calling create() on a facade returns a PendingInference or PendingEmbeddings handle without triggering the HTTP call. Execution is deferred until the application reads from the handle via get(), response(), or stream().
Driver registry. Inference drivers are resolved through InferenceDriverRegistry, which maps string names (like 'openai' or 'anthropic') to driver factory functions. Embeddings drivers use EmbeddingsDriverFactory with a similar pattern. Both support registering custom drivers at runtime.
Provider-agnostic data. The InferenceResponse and EmbeddingsResponse objects present a uniform shape regardless of which provider produced them. Provider-specific details are accessible through responseData() when needed, but the primary accessors (content(), toolCalls(), usage(), etc.) work identically across all providers.