Skip to content

Sanitize Unicode Surrogates#34

Open
jsimone wants to merge 2 commits intomainfrom
sanitize-unicode-surrogates
Open

Sanitize Unicode Surrogates#34
jsimone wants to merge 2 commits intomainfrom
sanitize-unicode-surrogates

Conversation

@jsimone
Copy link
Copy Markdown
Contributor

@jsimone jsimone commented May 7, 2026

Summary

Lone/unpaired Unicode surrogate characters (U+D800–U+DFFF) can reach LLM providers through tool output, user input, file content, and system prompts. When serialized to wire-format JSON these produce invalid UTF-8 sequences, causing provider-side request rejections, JSON serialization failures, and potential denial-of-service in message processing pipelines.

This applies the same surrogate sanitization strategy used in anomalyco/opencode#25934, replacing lone surrogates with the Unicode replacement character (U+FFFD) at the provider message boundary.

What changed

packages/shared/src/llms/ai-sdk-format.ts

  • Added sanitizeSurrogates(content: string): string — replaces unpaired surrogates with U+FFFD using the same regex as the OpenCode fix
  • Added sanitizeDeepStrings(value: unknown): unknown — recursively walks nested objects/arrays to sanitize all string values in structured tool output
  • toAiSdkToolResultOutput: sanitized text/error-text string values, content block text parts, json output (deep), image-extraction header text, and the fallback String() path
  • formatMessagesForAiSdk: sanitized system prompt strings, text parts, reasoning parts, and file content text

packages/llms/src/providers/ai-sdk.ts

  • toAiSdkMessages: sanitized text and reasoning parts as defense-in-depth for the primary provider path

packages/shared/src/llms/ai-sdk-format.test.ts

  • Added 11 tests covering: lone surrogate replacement, valid pair preservation, mixed surrogate text, system/user/assistant/tool message sanitization, content-block text, deeply nested JSON strings, and file content

Coverage

Text path Sanitized?
System prompt (string)
User message text parts
Assistant message text parts
Assistant reasoning parts
File content blocks
Tool-result text / error-text output
Tool-result content block text
Tool-result json output (deeply nested)
Tool-result fallback String()
Image-extraction header text

Verification

bun run --filter @clinebot/shared test -- --run   # 134 tests pass (12 files)

No regressions in existing tests.

@jsimone jsimone requested a review from abeatrix May 7, 2026 00:35
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 7, 2026

Greptile Summary

This PR adds surrogate-character sanitization across all message formatting paths to prevent lone Unicode surrogates (U+D800–U+DFFF) from reaching LLM providers as invalid UTF-8 in wire-format JSON. A sanitizeSurrogates helper and a recursive sanitizeDeepStrings walker are introduced and applied to system prompts, user/assistant text and reasoning parts, file content, and tool result outputs.

  • Image-extraction header path bug: sanitizeSurrogates(JSON.stringify(stripped)) calls JSON.stringify on a potentially surrogate-containing object before sanitization, which will throw a TypeError in Node.js 21+ / V8 12+; JSON.stringify(sanitizeDeepStrings(stripped)) is the correct order.
  • Array system content gap: When systemContent is passed as an AiSdkMessagePart[], the string values inside those parts bypass all sanitization; only the string-typed system prompt path is covered.
  • Test gap: No test covers the mixed-image-plus-lone-surrogate object case that triggers the image-extraction path bug.

Confidence Score: 3/5

The change is directionally correct and covers most formatting paths, but the image-extraction header code calls JSON.stringify on an unsanitized object, which can throw on modern runtimes — making the fix actively unsafe in that code path.

The sanitizeSurrogates(JSON.stringify(stripped)) ordering in the image-extraction branch will crash in Node.js 21+ when a tool output mixes image blocks with lone-surrogate strings; this turns a would-be fix into a new failure mode for that path. The rest of the sanitization logic is solid and the regex is correct.

packages/shared/src/llms/ai-sdk-format.ts — specifically the stripImagesFromOutput path in toAiSdkToolResultOutput and the array branch of the system-prompt push in formatMessagesForAiSdk

Important Files Changed

Filename Overview
packages/shared/src/llms/ai-sdk-format.ts Adds sanitizeSurrogates and sanitizeDeepStrings helpers and applies them across all message/tool-result formatting paths; one order-of-operations bug exists in the image-extraction header path where JSON.stringify runs before sanitization
packages/llms/src/providers/ai-sdk.ts Applies sanitizeSurrogates defensively to text and reasoning parts in toAiSdkMessages; straightforward and correct
packages/shared/src/llms/ai-sdk-format.test.ts Adds 11 surrogate-sanitization tests covering most paths; the image-stripping + surrogate edge case is not tested
packages/shared/src/index.ts Exports sanitizeSurrogates from the shared package; no issues
packages/shared/src/index.browser.ts Exports sanitizeSurrogates for the browser bundle; no issues

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Input: message / system / tool output] --> B{Type?}
    B -->|string systemContent| C[sanitizeSurrogates]
    B -->|array systemContent| D[NO sanitization]
    B -->|text part| E[sanitizeSurrogates]
    B -->|reasoning part| F[sanitizeSurrogates]
    B -->|file content| G[sanitizeSurrogates]
    B -->|tool-result output| H{Output shape?}
    H -->|string| I[sanitizeSurrogates]
    H -->|AiSdkContentBlock array| J[sanitizeSurrogates per text block]
    H -->|object with images| K{stripped result}
    K -->|string| L[sanitizeSurrogates]
    K -->|object| M[JSON.stringify THEN sanitizeSurrogates - may throw in Node 21+]
    H -->|plain object/null/bool/num| N[sanitizeDeepStrings recursive]
    H -->|other| O[sanitizeSurrogates on String cast]
    C --> P[AiSdkMessage output]
    D --> P
    E --> P
    F --> P
    G --> P
    I --> P
    J --> P
    L --> P
    M --> P
    N --> P
    O --> P
Loading

Comments Outside Diff (1)

  1. packages/shared/src/llms/ai-sdk-format.ts, line 275-286 (link)

    P2 When systemContent is an array of AiSdkMessagePart objects, any string values inside those parts are forwarded to the provider without sanitization. The string path is covered, but the array path — which can carry text parts with arbitrary string content — is left unguarded.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: packages/shared/src/llms/ai-sdk-format.ts
    Line: 275-286
    
    Comment:
    When `systemContent` is an array of `AiSdkMessagePart` objects, any string values inside those parts are forwarded to the provider without sanitization. The string path is covered, but the array path — which can carry `text` parts with arbitrary string content — is left unguarded.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
packages/shared/src/llms/ai-sdk-format.ts:230-234
The sanitization happens in the wrong order here. `JSON.stringify(stripped)` is called before `sanitizeSurrogates`, so if `stripped` is an object that contains lone-surrogate strings, V8 (Node.js 21+) will throw a `TypeError` during stringify — before any sanitization can occur. The fix is to deep-sanitize the object first, then stringify the already-clean data.

```suggestion
		if (images.length > 0) {
			const headerText =
				typeof stripped === "string"
					? sanitizeSurrogates(stripped)
					: JSON.stringify(sanitizeDeepStrings(stripped));
```

### Issue 2 of 2
packages/shared/src/llms/ai-sdk-format.ts:275-286
When `systemContent` is an array of `AiSdkMessagePart` objects, any string values inside those parts are forwarded to the provider without sanitization. The string path is covered, but the array path — which can carry `text` parts with arbitrary string content — is left unguarded.

```suggestion
	if (
		(typeof systemContent === "string" && systemContent.trim().length > 0) ||
		(Array.isArray(systemContent) && systemContent.length > 0)
	) {
		result.push({
			role: "system",
			content:
				typeof systemContent === "string"
					? sanitizeSurrogates(systemContent)
					: (sanitizeDeepStrings(systemContent) as AiSdkMessagePart[]),
		});
	}
```

Reviews (1): Last reviewed commit: "sanitize unicode surrogates" | Re-trigger Greptile

Comment on lines +230 to +234
if (images.length > 0) {
const headerText =
typeof stripped === "string" ? stripped : JSON.stringify(stripped);
typeof stripped === "string"
? sanitizeSurrogates(stripped)
: sanitizeSurrogates(JSON.stringify(stripped));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The sanitization happens in the wrong order here. JSON.stringify(stripped) is called before sanitizeSurrogates, so if stripped is an object that contains lone-surrogate strings, V8 (Node.js 21+) will throw a TypeError during stringify — before any sanitization can occur. The fix is to deep-sanitize the object first, then stringify the already-clean data.

Suggested change
if (images.length > 0) {
const headerText =
typeof stripped === "string" ? stripped : JSON.stringify(stripped);
typeof stripped === "string"
? sanitizeSurrogates(stripped)
: sanitizeSurrogates(JSON.stringify(stripped));
if (images.length > 0) {
const headerText =
typeof stripped === "string"
? sanitizeSurrogates(stripped)
: JSON.stringify(sanitizeDeepStrings(stripped));
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/shared/src/llms/ai-sdk-format.ts
Line: 230-234

Comment:
The sanitization happens in the wrong order here. `JSON.stringify(stripped)` is called before `sanitizeSurrogates`, so if `stripped` is an object that contains lone-surrogate strings, V8 (Node.js 21+) will throw a `TypeError` during stringify — before any sanitization can occur. The fix is to deep-sanitize the object first, then stringify the already-clean data.

```suggestion
		if (images.length > 0) {
			const headerText =
				typeof stripped === "string"
					? sanitizeSurrogates(stripped)
					: JSON.stringify(sanitizeDeepStrings(stripped));
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greptileai please review again

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks correct now. The updated line:

: JSON.stringify(sanitizeDeepStrings(stripped));

calls sanitizeDeepStrings before JSON.stringify, so lone surrogates are replaced before V8 ever attempts serialization. My original concern no longer applies — this is the right order of operations.

Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants