Batch Inputs
The single most impactful optimization is batching. Instead of making one request per document, send multiple texts in a single call. This reduces HTTP overhead and is often cheaper per token:maxInputs in the preset). For OpenAI this defaults to 2048; for Cohere it is 96. When processing large datasets, chunk your documents to stay within these limits.
Processing Large Datasets
When you have more documents than a single batch can handle, process them in chunks:Retry Policies
Network failures and rate limits are inevitable in production. Polyglot provides anEmbeddingsRetryPolicy that implements exponential backoff with configurable jitter:
Retry Policy Parameters
| Parameter | Default | Description |
|---|---|---|
maxAttempts | 1 | Total number of attempts (1 = no retries) |
baseDelayMs | 250 | Base delay in milliseconds before the first retry |
maxDelayMs | 8000 | Maximum delay cap in milliseconds |
jitter | 'full' | Jitter strategy: 'none', 'full', or 'equal' |
retryOnStatus | [408, 429, 500, 502, 503, 504] | HTTP status codes that trigger a retry |
retryOnExceptions | [TimeoutException, NetworkException] | Exception classes that trigger a retry |
baseDelayMs * 2^(attempt-1), capped at maxDelayMs, then jitter is applied:
none— Exact calculated delay, no randomization.full— Random value between 0 and the calculated delay. Best for reducing thundering herd.equal— Half the calculated delay plus a random value up to half. A middle ground.
Important: SetmaxAttemptsto at least3in production to handle transient failures gracefully. The default of1means no retries.
Caching Embeddings
Embedding the same text repeatedly is wasteful. For applications that frequently re-embed identical strings (such as search queries or template documents), a caching layer pays for itself quickly:Tip: For persistent caching across requests, replace the in-memory array with Redis, Memcached, or a database-backed store.
Choosing the Right Model
Model selection has a direct impact on both cost and quality. Here are the key trade-offs:| Factor | Smaller Models | Larger Models |
|---|---|---|
| Dimensions | Fewer (e.g., 256-1536) | More (e.g., 3072) |
| Speed | Faster response times | Slower response times |
| Cost | Lower per-token cost | Higher per-token cost |
| Quality | Good for general use | Better for nuanced similarity |
| Storage | Less memory per vector | More memory per vector |
text-embedding-3 models) support requesting a specific number of dimensions, letting you trade precision for storage efficiency:
Best Practices
Batch whenever possible. A single request with 100 texts is faster and cheaper than 100 individual requests. Set retry policies in production. Rate limits (HTTP 429) and transient server errors are common. Configure at least 3 attempts with jitter to handle them gracefully. Cache aggressively. Embeddings for the same text and model are deterministic. Cache them to avoid redundant API calls and reduce latency. Monitor token usage. Use theusage() method on responses to track consumption and detect unexpected spikes: