Documentation Index
Fetch the complete documentation index at: https://docs.instructorphp.com/llms.txt
Use this file to discover all available pages before exploring further.
Embeddings requests are typically fast and inexpensive, but at scale the details matter. This page covers the key patterns for keeping your embeddings pipeline efficient and reliable.
The single most impactful optimization is batching. Instead of making one request per document, send multiple texts in a single call. This reduces HTTP overhead and is often cheaper per token:
<?php
use Cognesy\Polyglot\Embeddings\Embeddings;
$response = Embeddings::using('openai')
->withInputs([
'Document one',
'Document two',
'Document three',
])
->get();
$vectors = $response->toValuesArray();
// @doctest id="6756"
Each provider has a maximum number of inputs per request (configured as maxInputs in the preset). For OpenAI this defaults to 2048; for Cohere it is 96. When processing large datasets, chunk your documents to stay within these limits.
Processing Large Datasets
When you have more documents than a single batch can handle, process them in chunks:
<?php
use Cognesy\Polyglot\Embeddings\Embeddings;
$embeddings = Embeddings::using('openai');
$allDocuments = [/* hundreds or thousands of documents */];
$batchSize = 25; // Stay well within provider limits
$vectors = [];
for ($i = 0; $i < count($allDocuments); $i += $batchSize) {
$batch = array_slice($allDocuments, $i, $batchSize);
try {
$response = $embeddings->withInputs($batch)->get();
$vectors = array_merge($vectors, $response->toValuesArray());
$batchNum = (int) floor($i / $batchSize) + 1;
$totalBatches = (int) ceil(count($allDocuments) / $batchSize);
echo "Processed batch {$batchNum} of {$totalBatches}\n";
} catch (\Exception $e) {
echo "Error processing batch: " . $e->getMessage() . "\n";
}
// Small delay to avoid hitting rate limits
usleep(100_000); // 100ms
}
echo "Processed " . count($vectors) . " embeddings in total.\n";
// @doctest id="c225"
Retry Policies
Network failures and rate limits are inevitable in production. Polyglot provides an EmbeddingsRetryPolicy that implements exponential backoff with configurable jitter:
<?php
use Cognesy\Polyglot\Embeddings\Config\EmbeddingsRetryPolicy;
use Cognesy\Polyglot\Embeddings\Embeddings;
$response = Embeddings::using('openai')
->withInputs(['Document one'])
->withRetryPolicy(new EmbeddingsRetryPolicy(
maxAttempts: 3,
baseDelayMs: 250,
maxDelayMs: 8000,
jitter: 'full',
retryOnStatus: [408, 429, 500, 502, 503, 504],
))
->get();
// @doctest id="440e"
Retry Policy Parameters
| Parameter | Default | Description |
|---|
maxAttempts | 1 | Total number of attempts (1 = no retries) |
baseDelayMs | 250 | Base delay in milliseconds before the first retry |
maxDelayMs | 8000 | Maximum delay cap in milliseconds |
jitter | 'full' | Jitter strategy: 'none', 'full', or 'equal' |
retryOnStatus | [408, 429, 500, 502, 503, 504] | HTTP status codes that trigger a retry |
retryOnExceptions | [TimeoutException, NetworkException] | Exception classes that trigger a retry |
The delay for each attempt is calculated as baseDelayMs * 2^(attempt-1), capped at maxDelayMs, then jitter is applied:
none — Exact calculated delay, no randomization.
full — Random value between 0 and the calculated delay. Best for reducing thundering herd.
equal — Half the calculated delay plus a random value up to half. A middle ground.
Important: Set maxAttempts to at least 3 in production to handle transient failures gracefully. The default of 1 means no retries.
Caching Embeddings
Embedding the same text repeatedly is wasteful. For applications that frequently re-embed identical strings (such as search queries or template documents), a caching layer pays for itself quickly:
<?php
use Cognesy\Polyglot\Embeddings\Embeddings;
class CachedEmbeddings
{
private Embeddings $embeddings;
/** @var array<string, float[]> */
private array $cache = [];
public function __construct(?Embeddings $embeddings = null)
{
$this->embeddings = $embeddings ?? Embeddings::using('openai');
}
/**
* Get the embedding for a single text, using cache when available.
*
* @return float[]
*/
public function embed(string $text, array $options = []): array
{
$key = $this->cacheKey($text, $options);
if (isset($this->cache[$key])) {
return $this->cache[$key];
}
$vector = $this->embeddings
->withInputs($text)
->withOptions($options)
->first()
->values();
$this->cache[$key] = $vector;
return $vector;
}
/**
* Embed multiple texts, fetching only uncached ones from the API.
*
* @param string[] $texts
* @return float[][]
*/
public function embedMany(array $texts, array $options = []): array
{
$results = [];
$uncachedTexts = [];
$uncachedIndices = [];
foreach ($texts as $i => $text) {
$key = $this->cacheKey($text, $options);
if (isset($this->cache[$key])) {
$results[$i] = $this->cache[$key];
} else {
$uncachedTexts[] = $text;
$uncachedIndices[] = $i;
}
}
if ($uncachedTexts !== []) {
$response = $this->embeddings
->withInputs($uncachedTexts)
->withOptions($options)
->get();
foreach ($response->toValuesArray() as $j => $vector) {
$i = $uncachedIndices[$j];
$results[$i] = $vector;
$this->cache[$this->cacheKey($texts[$i], $options)] = $vector;
}
}
ksort($results);
return $results;
}
private function cacheKey(string $text, array $options): string
{
return md5($text . serialize($options));
}
}
// @doctest id="97ea"
Usage:
<?php
$cached = new CachedEmbeddings(Embeddings::using('openai'));
// First call hits the API
$vector = $cached->embed('What is machine learning?');
// Second call returns from cache instantly
$vector = $cached->embed('What is machine learning?');
// Batch with partial cache hits
$vectors = $cached->embedMany([
'What is machine learning?', // cached
'How do neural networks work?', // API call
]);
// @doctest id="36cb"
Tip: For persistent caching across requests, replace the in-memory array with Redis, Memcached, or a database-backed store.
Choosing the Right Model
Model selection has a direct impact on both cost and quality. Here are the key trade-offs:
| Factor | Smaller Models | Larger Models |
|---|
| Dimensions | Fewer (e.g., 256-1536) | More (e.g., 3072) |
| Speed | Faster response times | Slower response times |
| Cost | Lower per-token cost | Higher per-token cost |
| Quality | Good for general use | Better for nuanced similarity |
| Storage | Less memory per vector | More memory per vector |
Some providers (like OpenAI’s text-embedding-3 models) support requesting a specific number of dimensions, letting you trade precision for storage efficiency:
<?php
use Cognesy\Polyglot\Embeddings\Embeddings;
// Full-dimension embedding (3072 dimensions)
$full = Embeddings::using('openai')
->withModel('text-embedding-3-large')
->withInputs('Sample text')
->first();
// Reduced-dimension embedding (256 dimensions, less storage)
$compact = Embeddings::using('openai')
->withModel('text-embedding-3-large')
->withInputs('Sample text')
->withOptions(['dimensions' => 256])
->first();
echo "Full: " . count($full->values()) . " dimensions\n";
echo "Compact: " . count($compact->values()) . " dimensions\n";
// @doctest id="2fab"
Best Practices
Batch whenever possible. A single request with 100 texts is faster and cheaper than 100 individual requests.
Set retry policies in production. Rate limits (HTTP 429) and transient server errors are common. Configure at least 3 attempts with jitter to handle them gracefully.
Cache aggressively. Embeddings for the same text and model are deterministic. Cache them to avoid redundant API calls and reduce latency.
Monitor token usage. Use the usage() method on responses to track consumption and detect unexpected spikes:
<?php
use Cognesy\Polyglot\Embeddings\Embeddings;
$response = Embeddings::using('openai')
->withInputs($documents)
->get();
$usage = $response->usage();
echo "Tokens used: " . $usage->total() . "\n";
// @doctest id="2504"
Match dimensions to your storage. If you are storing millions of vectors, reducing dimensions from 3072 to 256 can cut storage costs by over 90% with only modest quality loss.