Symptoms
- HTTP status code 429 (Too Many Requests)
- Error messages containing “rate limit exceeded,” “too many requests,” or “quota exceeded”
- Requests that work in isolation but fail under load
Use the Built-In Retry Policy
Polyglot can automatically retry failed requests with exponential backoff and jitter. Retries are opt-in and explicit — you must attach anInferenceRetryPolicy to the inference builder:
Retry Policy Parameters
| Parameter | Default | Description |
|---|---|---|
maxAttempts | 1 | Total number of attempts (1 means no retries) |
baseDelayMs | 250 | Base delay in milliseconds before the first retry |
maxDelayMs | 8000 | Maximum delay cap in milliseconds |
jitter | 'full' | Jitter strategy: none, full, or equal |
retryOnStatus | [408, 429, 500, 502, 503, 504] | HTTP status codes that trigger a retry |
retryOnExceptions | [TimeoutException, NetworkException] | Exception classes that trigger a retry |
baseDelayMs * 2^(attempt - 1), capped at maxDelayMs. The jitter strategy adds randomness to avoid thundering herd problems:
none— no randomness, uses the exact computed delayfull— random delay between 0 and the computed delayequal— half the computed delay plus a random value up to half the computed delay
Length Recovery
The retry policy also supports automatic recovery when a response is truncated due to token limits:Retry Policy for Embeddings
Embeddings requests use a separate policy class with the same interface:Application-Level Throttling
When retries alone are not enough, implement request throttling in your application to stay within the provider’s rate limits:Batch Requests to Reduce Volume
Instead of making many small requests, combine related questions into a single prompt when the use case allows:Additional Strategies
- Switch providers or models. Different providers and models have different rate limits. If one provider is heavily throttled, route some requests to another.
- Upgrade your API plan. Most providers offer higher rate limits on paid tiers.
- Cache responses. If the same prompts recur frequently, cache the results to avoid redundant API calls.
- Use off-peak hours. Some providers have lower contention during off-peak hours, reducing the likelihood of rate limiting.
- Monitor usage. Track your request volume and token consumption to anticipate rate limit issues before they affect users.