JSON Extraction Strategies
InstructorPHP uses multiple strategies to extract JSON from LLM responses, handling various edge cases where the LLM might return JSON wrapped in markdown, text, or malformed.Extraction Pipeline
When processing an LLM response, InstructorPHP tries multiple extraction strategies in order:1. Direct Parsing (Try As-Is)
Attempts to parse the response directly as JSON:2. Markdown Code Block Extraction
Extracts JSON from markdown fenced code blocks:json and
4. Smart Brace Matching
Handles nested braces and escaped quotes: // @doctest id=“c961”Parsing Strategies
After extraction, multiple parsers attempt to handle malformed JSON:1. Standard JSON Parser
Nativejson_decode with strict error handling.
// @doctest id=“450f”
2. Resilient Parser
Applies automatic repairs before parsing:- Balance quotes - Adds missing closing quotes
- Remove trailing commas - Fixes
{"a": 1,} - Balance braces - Adds missing
}or]// @doctest id=“0692”
3. Partial JSON Parser
Handles incomplete JSON during streaming: // @doctest id=“e25f”Implementation Details
Location:packages/utils/src/Json/JsonParser.php
// @doctest id=“d621”
Why This Matters
LLMs don’t always return clean JSON:- Claude sometimes wraps in markdown
- GPT-4 may add explanations
- Gemini might include partial responses during streaming
- Custom prompts can lead to unexpected formats
Common Scenarios
Scenario 1: LLM Adds Explanation
// @doctest id=“5745”Scenario 2: Markdown Wrapped Response
// @doctest id=“8661”Scenario 4: Streaming Partial Response
// @doctest id=“87d1”Error Handling
If all strategies fail, InstructorPHP:- Returns an empty string from
findCompleteJson() - Triggers a validation error
- Initiates retry mechanism (if configured)
- Provides error feedback to LLM for self-correction
Performance Considerations
Extraction overhead:- Direct parsing: ~0.1ms
- Markdown extraction: ~0.5ms (regex)
- Bracket matching: ~0.2ms (string ops)
- Smart brace matching: ~1-2ms (character iteration)