Conversation
Greptile SummaryThis PR adds surrogate-character sanitization across all message formatting paths to prevent lone Unicode surrogates (U+D800–U+DFFF) from reaching LLM providers as invalid UTF-8 in wire-format JSON. A
Confidence Score: 3/5The change is directionally correct and covers most formatting paths, but the image-extraction header code calls JSON.stringify on an unsanitized object, which can throw on modern runtimes — making the fix actively unsafe in that code path. The sanitizeSurrogates(JSON.stringify(stripped)) ordering in the image-extraction branch will crash in Node.js 21+ when a tool output mixes image blocks with lone-surrogate strings; this turns a would-be fix into a new failure mode for that path. The rest of the sanitization logic is solid and the regex is correct. packages/shared/src/llms/ai-sdk-format.ts — specifically the stripImagesFromOutput path in toAiSdkToolResultOutput and the array branch of the system-prompt push in formatMessagesForAiSdk
|
| Filename | Overview |
|---|---|
| packages/shared/src/llms/ai-sdk-format.ts | Adds sanitizeSurrogates and sanitizeDeepStrings helpers and applies them across all message/tool-result formatting paths; one order-of-operations bug exists in the image-extraction header path where JSON.stringify runs before sanitization |
| packages/llms/src/providers/ai-sdk.ts | Applies sanitizeSurrogates defensively to text and reasoning parts in toAiSdkMessages; straightforward and correct |
| packages/shared/src/llms/ai-sdk-format.test.ts | Adds 11 surrogate-sanitization tests covering most paths; the image-stripping + surrogate edge case is not tested |
| packages/shared/src/index.ts | Exports sanitizeSurrogates from the shared package; no issues |
| packages/shared/src/index.browser.ts | Exports sanitizeSurrogates for the browser bundle; no issues |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Input: message / system / tool output] --> B{Type?}
B -->|string systemContent| C[sanitizeSurrogates]
B -->|array systemContent| D[NO sanitization]
B -->|text part| E[sanitizeSurrogates]
B -->|reasoning part| F[sanitizeSurrogates]
B -->|file content| G[sanitizeSurrogates]
B -->|tool-result output| H{Output shape?}
H -->|string| I[sanitizeSurrogates]
H -->|AiSdkContentBlock array| J[sanitizeSurrogates per text block]
H -->|object with images| K{stripped result}
K -->|string| L[sanitizeSurrogates]
K -->|object| M[JSON.stringify THEN sanitizeSurrogates - may throw in Node 21+]
H -->|plain object/null/bool/num| N[sanitizeDeepStrings recursive]
H -->|other| O[sanitizeSurrogates on String cast]
C --> P[AiSdkMessage output]
D --> P
E --> P
F --> P
G --> P
I --> P
J --> P
L --> P
M --> P
N --> P
O --> P
Comments Outside Diff (1)
-
packages/shared/src/llms/ai-sdk-format.ts, line 275-286 (link)When
systemContentis an array ofAiSdkMessagePartobjects, any string values inside those parts are forwarded to the provider without sanitization. The string path is covered, but the array path — which can carrytextparts with arbitrary string content — is left unguarded.Prompt To Fix With AI
This is a comment left during a code review. Path: packages/shared/src/llms/ai-sdk-format.ts Line: 275-286 Comment: When `systemContent` is an array of `AiSdkMessagePart` objects, any string values inside those parts are forwarded to the provider without sanitization. The string path is covered, but the array path — which can carry `text` parts with arbitrary string content — is left unguarded. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
packages/shared/src/llms/ai-sdk-format.ts:230-234
The sanitization happens in the wrong order here. `JSON.stringify(stripped)` is called before `sanitizeSurrogates`, so if `stripped` is an object that contains lone-surrogate strings, V8 (Node.js 21+) will throw a `TypeError` during stringify — before any sanitization can occur. The fix is to deep-sanitize the object first, then stringify the already-clean data.
```suggestion
if (images.length > 0) {
const headerText =
typeof stripped === "string"
? sanitizeSurrogates(stripped)
: JSON.stringify(sanitizeDeepStrings(stripped));
```
### Issue 2 of 2
packages/shared/src/llms/ai-sdk-format.ts:275-286
When `systemContent` is an array of `AiSdkMessagePart` objects, any string values inside those parts are forwarded to the provider without sanitization. The string path is covered, but the array path — which can carry `text` parts with arbitrary string content — is left unguarded.
```suggestion
if (
(typeof systemContent === "string" && systemContent.trim().length > 0) ||
(Array.isArray(systemContent) && systemContent.length > 0)
) {
result.push({
role: "system",
content:
typeof systemContent === "string"
? sanitizeSurrogates(systemContent)
: (sanitizeDeepStrings(systemContent) as AiSdkMessagePart[]),
});
}
```
Reviews (1): Last reviewed commit: "sanitize unicode surrogates" | Re-trigger Greptile
| if (images.length > 0) { | ||
| const headerText = | ||
| typeof stripped === "string" ? stripped : JSON.stringify(stripped); | ||
| typeof stripped === "string" | ||
| ? sanitizeSurrogates(stripped) | ||
| : sanitizeSurrogates(JSON.stringify(stripped)); |
There was a problem hiding this comment.
The sanitization happens in the wrong order here.
JSON.stringify(stripped) is called before sanitizeSurrogates, so if stripped is an object that contains lone-surrogate strings, V8 (Node.js 21+) will throw a TypeError during stringify — before any sanitization can occur. The fix is to deep-sanitize the object first, then stringify the already-clean data.
| if (images.length > 0) { | |
| const headerText = | |
| typeof stripped === "string" ? stripped : JSON.stringify(stripped); | |
| typeof stripped === "string" | |
| ? sanitizeSurrogates(stripped) | |
| : sanitizeSurrogates(JSON.stringify(stripped)); | |
| if (images.length > 0) { | |
| const headerText = | |
| typeof stripped === "string" | |
| ? sanitizeSurrogates(stripped) | |
| : JSON.stringify(sanitizeDeepStrings(stripped)); |
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/shared/src/llms/ai-sdk-format.ts
Line: 230-234
Comment:
The sanitization happens in the wrong order here. `JSON.stringify(stripped)` is called before `sanitizeSurrogates`, so if `stripped` is an object that contains lone-surrogate strings, V8 (Node.js 21+) will throw a `TypeError` during stringify — before any sanitization can occur. The fix is to deep-sanitize the object first, then stringify the already-clean data.
```suggestion
if (images.length > 0) {
const headerText =
typeof stripped === "string"
? sanitizeSurrogates(stripped)
: JSON.stringify(sanitizeDeepStrings(stripped));
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
The fix looks correct now. The updated line:
: JSON.stringify(sanitizeDeepStrings(stripped));calls sanitizeDeepStrings before JSON.stringify, so lone surrogates are replaced before V8 ever attempts serialization. My original concern no longer applies — this is the right order of operations.
Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.
Summary
Lone/unpaired Unicode surrogate characters (U+D800–U+DFFF) can reach LLM providers through tool output, user input, file content, and system prompts. When serialized to wire-format JSON these produce invalid UTF-8 sequences, causing provider-side request rejections, JSON serialization failures, and potential denial-of-service in message processing pipelines.
This applies the same surrogate sanitization strategy used in anomalyco/opencode#25934, replacing lone surrogates with the Unicode replacement character (U+FFFD) at the provider message boundary.
What changed
packages/shared/src/llms/ai-sdk-format.tssanitizeSurrogates(content: string): string— replaces unpaired surrogates with U+FFFD using the same regex as the OpenCode fixsanitizeDeepStrings(value: unknown): unknown— recursively walks nested objects/arrays to sanitize all string values in structured tool outputtoAiSdkToolResultOutput: sanitizedtext/error-textstring values,contentblock text parts,jsonoutput (deep),image-extractionheader text, and the fallbackString()pathformatMessagesForAiSdk: sanitized system prompt strings, text parts, reasoning parts, and file content textpackages/llms/src/providers/ai-sdk.tstoAiSdkMessages: sanitized text and reasoning parts as defense-in-depth for the primary provider pathpackages/shared/src/llms/ai-sdk-format.test.tsCoverage
text/error-textoutputcontentblock textjsonoutput (deeply nested)String()Verification
No regressions in existing tests.