Sanitize Unicode Surrogates by jsimone · Pull Request #34 · cline/sdk

jsimone · 2026-05-07T00:35:54Z

Summary

Lone/unpaired Unicode surrogate characters (U+D800–U+DFFF) can reach LLM providers through tool output, user input, file content, and system prompts. When serialized to wire-format JSON these produce invalid UTF-8 sequences, causing provider-side request rejections, JSON serialization failures, and potential denial-of-service in message processing pipelines.

This applies the same surrogate sanitization strategy used in anomalyco/opencode#25934, replacing lone surrogates with the Unicode replacement character (U+FFFD) at the provider message boundary.

What changed

packages/shared/src/llms/ai-sdk-format.ts

Added sanitizeSurrogates(content: string): string — replaces unpaired surrogates with U+FFFD using the same regex as the OpenCode fix
Added sanitizeDeepStrings(value: unknown): unknown — recursively walks nested objects/arrays to sanitize all string values in structured tool output
toAiSdkToolResultOutput: sanitized text/error-text string values, content block text parts, json output (deep), image-extraction header text, and the fallback String() path
formatMessagesForAiSdk: sanitized system prompt strings, text parts, reasoning parts, and file content text

packages/llms/src/providers/ai-sdk.ts

toAiSdkMessages: sanitized text and reasoning parts as defense-in-depth for the primary provider path

packages/shared/src/llms/ai-sdk-format.test.ts

Added 11 tests covering: lone surrogate replacement, valid pair preservation, mixed surrogate text, system/user/assistant/tool message sanitization, content-block text, deeply nested JSON strings, and file content

Coverage

Text path	Sanitized?
System prompt (string)	✅
User message text parts	✅
Assistant message text parts	✅
Assistant reasoning parts	✅
File content blocks	✅
Tool-result `text` / `error-text` output	✅
Tool-result `content` block text	✅
Tool-result `json` output (deeply nested)	✅
Tool-result fallback `String()`	✅
Image-extraction header text	✅

Verification

bun run --filter @clinebot/shared test -- --run   # 134 tests pass (12 files)

No regressions in existing tests.

greptile-apps · 2026-05-07T00:40:01Z

Greptile Summary

This PR adds surrogate-character sanitization across all message formatting paths to prevent lone Unicode surrogates (U+D800–U+DFFF) from reaching LLM providers as invalid UTF-8 in wire-format JSON. A sanitizeSurrogates helper and a recursive sanitizeDeepStrings walker are introduced and applied to system prompts, user/assistant text and reasoning parts, file content, and tool result outputs.

Image-extraction header path bug: sanitizeSurrogates(JSON.stringify(stripped)) calls JSON.stringify on a potentially surrogate-containing object before sanitization, which will throw a TypeError in Node.js 21+ / V8 12+; JSON.stringify(sanitizeDeepStrings(stripped)) is the correct order.
Array system content gap: When systemContent is passed as an AiSdkMessagePart[], the string values inside those parts bypass all sanitization; only the string-typed system prompt path is covered.
Test gap: No test covers the mixed-image-plus-lone-surrogate object case that triggers the image-extraction path bug.

Confidence Score: 3/5

The change is directionally correct and covers most formatting paths, but the image-extraction header code calls JSON.stringify on an unsanitized object, which can throw on modern runtimes — making the fix actively unsafe in that code path.

The sanitizeSurrogates(JSON.stringify(stripped)) ordering in the image-extraction branch will crash in Node.js 21+ when a tool output mixes image blocks with lone-surrogate strings; this turns a would-be fix into a new failure mode for that path. The rest of the sanitization logic is solid and the regex is correct.

packages/shared/src/llms/ai-sdk-format.ts — specifically the stripImagesFromOutput path in toAiSdkToolResultOutput and the array branch of the system-prompt push in formatMessagesForAiSdk

Important Files Changed

Filename	Overview
packages/shared/src/llms/ai-sdk-format.ts	Adds sanitizeSurrogates and sanitizeDeepStrings helpers and applies them across all message/tool-result formatting paths; one order-of-operations bug exists in the image-extraction header path where JSON.stringify runs before sanitization
packages/llms/src/providers/ai-sdk.ts	Applies sanitizeSurrogates defensively to text and reasoning parts in toAiSdkMessages; straightforward and correct
packages/shared/src/llms/ai-sdk-format.test.ts	Adds 11 surrogate-sanitization tests covering most paths; the image-stripping + surrogate edge case is not tested
packages/shared/src/index.ts	Exports sanitizeSurrogates from the shared package; no issues
packages/shared/src/index.browser.ts	Exports sanitizeSurrogates for the browser bundle; no issues

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Input: message / system / tool output] --> B{Type?}
    B -->|string systemContent| C[sanitizeSurrogates]
    B -->|array systemContent| D[NO sanitization]
    B -->|text part| E[sanitizeSurrogates]
    B -->|reasoning part| F[sanitizeSurrogates]
    B -->|file content| G[sanitizeSurrogates]
    B -->|tool-result output| H{Output shape?}
    H -->|string| I[sanitizeSurrogates]
    H -->|AiSdkContentBlock array| J[sanitizeSurrogates per text block]
    H -->|object with images| K{stripped result}
    K -->|string| L[sanitizeSurrogates]
    K -->|object| M[JSON.stringify THEN sanitizeSurrogates - may throw in Node 21+]
    H -->|plain object/null/bool/num| N[sanitizeDeepStrings recursive]
    H -->|other| O[sanitizeSurrogates on String cast]
    C --> P[AiSdkMessage output]
    D --> P
    E --> P
    F --> P
    G --> P
    I --> P
    J --> P
    L --> P
    M --> P
    N --> P
    O --> P

Comments Outside Diff (1)

packages/shared/src/llms/ai-sdk-format.ts, line 275-286 (link)

When systemContent is an array of AiSdkMessagePart objects, any string values inside those parts are forwarded to the provider without sanitization. The string path is covered, but the array path — which can carry text parts with arbitrary string content — is left unguarded.

Prompt To Fix With AI

This is a comment left during a code review.
Path: packages/shared/src/llms/ai-sdk-format.ts
Line: 275-286

Comment:
When `systemContent` is an array of `AiSdkMessagePart` objects, any string values inside those parts are forwarded to the provider without sanitization. The string path is covered, but the array path — which can carry `text` parts with arbitrary string content — is left unguarded.



How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
packages/shared/src/llms/ai-sdk-format.ts:230-234
The sanitization happens in the wrong order here. `JSON.stringify(stripped)` is called before `sanitizeSurrogates`, so if `stripped` is an object that contains lone-surrogate strings, V8 (Node.js 21+) will throw a `TypeError` during stringify — before any sanitization can occur. The fix is to deep-sanitize the object first, then stringify the already-clean data.

```suggestion
		if (images.length > 0) {
			const headerText =
				typeof stripped === "string"
					? sanitizeSurrogates(stripped)
					: JSON.stringify(sanitizeDeepStrings(stripped));
```

### Issue 2 of 2
packages/shared/src/llms/ai-sdk-format.ts:275-286
When `systemContent` is an array of `AiSdkMessagePart` objects, any string values inside those parts are forwarded to the provider without sanitization. The string path is covered, but the array path — which can carry `text` parts with arbitrary string content — is left unguarded.

```suggestion
	if (
		(typeof systemContent === "string" && systemContent.trim().length > 0) ||
		(Array.isArray(systemContent) && systemContent.length > 0)
	) {
		result.push({
			role: "system",
			content:
				typeof systemContent === "string"
					? sanitizeSurrogates(systemContent)
					: (sanitizeDeepStrings(systemContent) as AiSdkMessagePart[]),
		});
	}
```

_{Reviews (1): Last reviewed commit: "sanitize unicode surrogates" | Re-trigger Greptile}

greptile-apps · 2026-05-07T00:40:05Z

 		if (images.length > 0) {
 			const headerText =
-				typeof stripped === "string" ? stripped : JSON.stringify(stripped);
+				typeof stripped === "string"
+					? sanitizeSurrogates(stripped)
+					: sanitizeSurrogates(JSON.stringify(stripped));


The sanitization happens in the wrong order here. JSON.stringify(stripped) is called before sanitizeSurrogates, so if stripped is an object that contains lone-surrogate strings, V8 (Node.js 21+) will throw a TypeError during stringify — before any sanitization can occur. The fix is to deep-sanitize the object first, then stringify the already-clean data.

Suggested change

if (images.length > 0) {

const headerText =

typeof stripped === "string" ? stripped : JSON.stringify(stripped);

typeof stripped === "string"

? sanitizeSurrogates(stripped)

: sanitizeSurrogates(JSON.stringify(stripped));

if (images.length > 0) {

const headerText =

typeof stripped === "string"

? sanitizeSurrogates(stripped)

: JSON.stringify(sanitizeDeepStrings(stripped));

Prompt To Fix With AI

This is a comment left during a code review. Path: packages/shared/src/llms/ai-sdk-format.ts Line: 230-234 Comment: The sanitization happens in the wrong order here. `JSON.stringify(stripped)` is called before `sanitizeSurrogates`, so if `stripped` is an object that contains lone-surrogate strings, V8 (Node.js 21+) will throw a `TypeError` during stringify — before any sanitization can occur. The fix is to deep-sanitize the object first, then stringify the already-clean data. ```suggestion if (images.length > 0) { const headerText = typeof stripped === "string" ? sanitizeSurrogates(stripped) : JSON.stringify(sanitizeDeepStrings(stripped)); ``` How can I resolve this? If you propose a fix, please make it concise.

@greptileai please review again

The fix looks correct now. The updated line:

: JSON.stringify(sanitizeDeepStrings(stripped));

calls sanitizeDeepStrings before JSON.stringify, so lone surrogates are replaced before V8 ever attempts serialization. My original concern no longer applies — this is the right order of operations.

Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.

sanitize unicode surrogates

5de5126

jsimone requested a review from abeatrix May 7, 2026 00:35

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

change ordering of sanitization

1eda627

abeatrix approved these changes May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanitize Unicode Surrogates#34

Sanitize Unicode Surrogates#34
jsimone wants to merge 2 commits intomainfrom
sanitize-unicode-surrogates

jsimone commented May 7, 2026

Uh oh!

greptile-apps Bot commented May 7, 2026 •

edited

Loading

Important Files Changed

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

jsimone May 7, 2026

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jsimone commented May 7, 2026

Summary

What changed

Coverage

Verification

Uh oh!

greptile-apps Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

jsimone May 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 7, 2026 •

edited

Loading