Skip to content

Commit ba31be1

Browse files
authored
perf(pdf-server): lazy form extraction via range transport + incremental viewer scans (#639)
* perf(pdf-server): lazy form extraction via range transport + incremental viewer scans Server: display_pdf now opens the document via PDFDataRangeTransport (disableAutoFetch) and only runs the per-page form/annotation walk when getFieldObjects() is non-empty, so form-free PDFs are probed with ~10-25% of bytes instead of a full download. The unused viewFieldInfo Map is removed. Viewer: getDocument sets disableAutoFetch/disableStream; baseline annotation scan and field-name mapping run lazily per rendered page instead of walking every page after load, so first paint no longer schedules a full-file pull. E2E: new range-counting HTTPS fixture (W-9 for forms, generated text+image PDF for no-forms) with stallAfterBytes control, and four regression tests asserting form fields are returned, <30% served on no-forms display_pdf, first page renders while later ranges are stalled, and overlap stays bounded. * fix(pdf-server): handle separated field/widget trees in extractFormSchema pdfjs getFieldObjects() returns the full field-tree array. For PDFs with a separated structure (pdf-lib, some authoring tools) the typed widget sits at fields[1+] behind a typeless container at fields[0]; the previous code only inspected fields[0] and skipped them all. Pick the first entry with a non-empty type instead. Makes the e2e forms.pdf fixture fully generated (no checked-in third-party asset on the hot path); fw9.pdf stays as a unit-test fixture for the hierarchical/XFA case. * fix(pdf-server): address range-transport hangs and lazy-scan tombstone loss PdfCacheRangeTransport: - abort() is a no-op stub on PDFDataRangeTransport (it's the hook pdfjs calls *on* the transport, not an upstream error channel). Expose a .failed promise that rejects on the first fetch error and race every pdfjs await against it in display_pdf, so transient network errors surface into the existing catch instead of hanging the tool call. - pdfjs coalesces adjacent missing chunks into one unbounded requestDataRange; readPdfRange clamps each call to MAX_CHUNK_BYTES. Loop and deliver in slices so every requested chunk is marked loaded. Viewer (mcp-app.ts): - The lazy per-page baseline scan left pdfBaselineAnnotations incomplete, so persistAnnotations and getAnnotatedPdfBytes silently dropped restoredRemovedIds tombstones for unvisited pages. Union those ids into the computed diff and removedRefs. Test fixture: release stalled handlers before resetStats/close so a failing stalled test doesn't hang afterAll; fail fast if started with NODE_ENV=production. NODE_TLS_REJECT_UNAUTHORIZED scope documented (full per-process scoping needs a validateUrl localhost allow, tracked separately). * test(pdf-server): bridge coverage gaps and switch fixture to gated loopback HTTP PdfCacheRangeTransport.deliver(): pdf.js's reader is keyed by the original begin and removed after one delivery, so accumulate slices and call onDataRange once with the full buffer (the previous multi-call approach threw inside pdfjs). Covered by a new integration test that drives getDocument()/getPage(1) on a >1MB PDF through a clamping in-memory readPdfRange. validateUrl: allow http://127.0.0.1|localhost|[::1] only when PDF_SERVER_ALLOW_LOOPBACK_HTTP is set, so a remote deploy can't be made to probe its own ports. Covered by env-on/off unit tests. Fixture switched to plain HTTP (no openssl, no NODE_TLS_REJECT_UNAUTHORIZED). Adds /error.pdf (500s after 50KB) and two e2e tests: page 2 renders after stall release (>512KB object path), and display_pdf returns gracefully on mid-load 500. Existing <30% test now samples stats before the iframe loads. New unit tests: display_pdf returns (not hangs) on mid-load fetch failure via in-memory MCP client; computeDiff/serializeDiff contract tests pinning the restoredRemovedIds tombstone-preservation behaviour. * test(pdf-server): e2e for restoredRemovedIds tombstone preservation through viewer Adds /with-native-annot.pdf fixture (2 pages, native /Text annot on page 2) and a Playwright test that drives delete on page 2 → iframe reload → interact add_annotations on page 1 (persist with page 2 unscanned) → assert localStorage diff still contains the removed id → page 2 still shows it gone. Covers the mcp-app.ts glue the unit tests can't reach. * refactor(pdf-server): simplify per round-2 review and fix false-pass test server.ts: extract probeFormFields() so display_pdf calls one helper instead of inlining 40 lines of transport/orFail plumbing; make extractFormSchema's fieldObjects param required (the optional branch only existed for a test helper). server.test.ts: the >1MB integration test now forces the image XObject fetch via getOperatorList() and asserts max requestDataRange span > MAX_CHUNK_BYTES, so it can't go vacuous. Dedupe makeRandomJpeg by importing from the fixture. mcp-app.ts: restore per-annotation try/catch isolation in the lazy baseline scan (a throw was skipping AnnotationLayer.render); only carry forward restoredRemovedIds while the baseline scan is incomplete so a stale id can't pin dirty=true once every page is scanned. tests: bump fixture JPEG to 1.1MB; consolidate 7 e2e tests to 4 (merge overlap into the byte-budget test, merge stall+page-2, drop the error-e2e duplicate of the unit integration test); delete run-fixture.mjs; drop /error.pdf and ERROR_AFTER_BYTES; rename tombstone test's describe and clarify page-2 covers the viewer transport. * fix(pdf-server): preserve partial probeFormFields results when extractFormFieldInfo throws * test(pdf-server): fix tombstone e2e — deleted native annotations become a cleared card, not removed from DOM * test(pdf-server): mark tombstone e2e as fixme — needs basic-host iframe-reload replay support * test(pdf-server): link tombstone fixme to tracking issue #642
1 parent 30a78b6 commit ba31be1

8 files changed

Lines changed: 1133 additions & 106 deletions

File tree

examples/pdf-server/server.test.ts

Lines changed: 266 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,15 @@ import os from "node:os";
44
import path from "node:path";
55
import { Client } from "@modelcontextprotocol/sdk/client/index.js";
66
import { InMemoryTransport } from "@modelcontextprotocol/sdk/inMemory.js";
7+
import { getDocument } from "pdfjs-dist/legacy/build/pdf.mjs";
8+
import { PDFDocument } from "pdf-lib";
9+
import { makeRandomJpeg } from "../../tests/helpers/range-counting-server";
710
import {
811
createPdfCache,
912
createServer,
13+
extractFormSchema,
14+
PdfCacheRangeTransport,
15+
MAX_CHUNK_BYTES,
1016
validateUrl,
1117
isAncestorDir,
1218
allowedLocalFiles,
@@ -289,6 +295,266 @@ describe("PDF Cache with Timeouts", () => {
289295
// through manual testing or E2E tests.
290296
});
291297

298+
describe("PdfCacheRangeTransport", () => {
299+
it("accumulates ranges larger than MAX_CHUNK_BYTES into one onDataRange call", async () => {
300+
const big = MAX_CHUNK_BYTES * 2 + 100;
301+
const reads: Array<[number, number]> = [];
302+
const t = new PdfCacheRangeTransport("u", big, async (_u, off, n) => {
303+
reads.push([off, n]);
304+
return {
305+
data: new Uint8Array(Math.min(n, MAX_CHUNK_BYTES)),
306+
totalBytes: big,
307+
};
308+
});
309+
const delivered: Array<[number, number]> = [];
310+
t.addRangeListener((begin: number, chunk: Uint8Array) =>
311+
delivered.push([begin, chunk.length]),
312+
);
313+
t.requestDataRange(0, big);
314+
await new Promise((r) => setTimeout(r, 10));
315+
// pdf.js's reader is keyed by the original begin and removed after one
316+
// delivery, so deliver() must call onDataRange exactly once with the
317+
// accumulated buffer — multiple calls would throw inside pdfjs.
318+
expect(delivered).toEqual([[0, big]]);
319+
expect(reads).toEqual([
320+
[0, MAX_CHUNK_BYTES],
321+
[MAX_CHUNK_BYTES, MAX_CHUNK_BYTES],
322+
[MAX_CHUNK_BYTES * 2, 100],
323+
]);
324+
});
325+
326+
it("rejects .failed when a range fetch errors instead of hanging", async () => {
327+
const t = new PdfCacheRangeTransport("u", 1000, async () => {
328+
throw new Error("network down");
329+
});
330+
t.requestDataRange(0, 100);
331+
await expect(
332+
Promise.race([
333+
t.failed,
334+
new Promise((r) => setTimeout(() => r("timeout"), 200)),
335+
]),
336+
).rejects.toThrow("network down");
337+
});
338+
339+
it("rejects .failed on zero-length response (would otherwise spin)", async () => {
340+
const t = new PdfCacheRangeTransport("u", 1000, async () => ({
341+
data: new Uint8Array(0),
342+
totalBytes: 1000,
343+
}));
344+
t.requestDataRange(0, 100);
345+
await expect(t.failed).rejects.toThrow(/empty range/);
346+
});
347+
348+
it("getDocument resolves on a >1MB PDF when readPdfRange clamps to MAX_CHUNK_BYTES", async () => {
349+
// pdfjs coalesces adjacent missing chunks into one requestDataRange that
350+
// can exceed MAX_CHUNK_BYTES. deliver() must accumulate clamped reads and
351+
// hand pdfjs a single onDataRange(begin, fullBuffer). This test fails if
352+
// deliver() either truncates or calls onDataRange more than once per
353+
// requestDataRange (pdf.mjs _onReceiveData matches by exact begin).
354+
const d = await PDFDocument.create();
355+
const img = await d.embedJpg(makeRandomJpeg(1_100_000));
356+
const page = d.addPage([612, 792]);
357+
page.drawImage(img, { x: 36, y: 36, width: 540, height: 720 });
358+
const bytes = await d.save();
359+
expect(bytes.length).toBeGreaterThan(2 * MAX_CHUNK_BYTES);
360+
361+
const readClamped: PdfCache["readPdfRange"] = async (_u, off, n) => {
362+
const len = Math.min(n, MAX_CHUNK_BYTES, bytes.length - off);
363+
return { data: bytes.slice(off, off + len), totalBytes: bytes.length };
364+
};
365+
// Record the spans pdfjs actually requests so the test fails fast if it
366+
// never asks for >MAX_CHUNK_BYTES (i.e. can't go vacuously green).
367+
const spans: number[] = [];
368+
class RecordingTransport extends PdfCacheRangeTransport {
369+
override requestDataRange(begin: number, end: number): void {
370+
spans.push(end - begin);
371+
super.requestDataRange(begin, end);
372+
}
373+
}
374+
const transport = new RecordingTransport(
375+
"mem://big",
376+
bytes.length,
377+
readClamped,
378+
);
379+
380+
const orHang = <T>(p: Promise<T>, what: string): Promise<T> =>
381+
Promise.race([
382+
p,
383+
transport.failed,
384+
new Promise<never>((_, rej) =>
385+
setTimeout(() => rej(new Error(`${what} hung`)), 5000),
386+
),
387+
]);
388+
389+
const doc = await orHang(
390+
getDocument({
391+
range: transport,
392+
length: bytes.length,
393+
disableAutoFetch: true,
394+
disableStream: true,
395+
rangeChunkSize: 64 * 1024,
396+
}).promise,
397+
"getDocument",
398+
);
399+
const p1 = await orHang(doc.getPage(1), "getPage");
400+
// getPage() alone doesn't decode the image XObject; getOperatorList() does,
401+
// which is what triggers the >512KB coalesced range request.
402+
await orHang(p1.getOperatorList(), "getOperatorList");
403+
expect(Math.max(...spans)).toBeGreaterThan(MAX_CHUNK_BYTES);
404+
doc.destroy();
405+
});
406+
});
407+
408+
describe("display_pdf transport-error handling", () => {
409+
it("returns (does not hang) when range fetches fail mid-load", async () => {
410+
// First fetch = the 1-byte size probe → 206 with Content-Range so
411+
// display_pdf gets totalBytes. Every subsequent fetch (made by
412+
// PdfCacheRangeTransport via readPdfRange) rejects, which must surface
413+
// through transport.failed → orFail() → outer catch, not hang.
414+
let calls = 0;
415+
const mockFetch = spyOn(globalThis, "fetch").mockImplementation(
416+
async () => {
417+
if (calls++ === 0) {
418+
return new Response(new Uint8Array(1), {
419+
status: 206,
420+
headers: { "Content-Range": "bytes 0-0/50000" },
421+
});
422+
}
423+
throw new Error("network down");
424+
},
425+
);
426+
427+
const server = createServer();
428+
const client = new Client({ name: "t", version: "1" });
429+
const [ct, st] = InMemoryTransport.createLinkedPair();
430+
await Promise.all([server.connect(st), client.connect(ct)]);
431+
432+
try {
433+
const result = await Promise.race([
434+
client.callTool({
435+
name: "display_pdf",
436+
arguments: { url: "https://arxiv.org/pdf/err-test" },
437+
}),
438+
new Promise<never>((_, rej) =>
439+
setTimeout(
440+
() => rej(new Error("display_pdf hung on transport error")),
441+
3000,
442+
),
443+
),
444+
]);
445+
expect(result.isError).toBeFalsy();
446+
const sc = result.structuredContent as { formFields?: unknown };
447+
expect(sc.formFields).toBeUndefined();
448+
expect(calls).toBeGreaterThan(1);
449+
} finally {
450+
mockFetch.mockRestore();
451+
await client.close();
452+
await server.close();
453+
}
454+
});
455+
});
456+
457+
describe("extractFormSchema field-tree handling", () => {
458+
async function schemaFor(bytes: Uint8Array) {
459+
const doc = await getDocument({ data: bytes }).promise;
460+
try {
461+
const fo = (await doc.getFieldObjects()) as Parameters<
462+
typeof extractFormSchema
463+
>[1];
464+
return await extractFormSchema(doc, fo);
465+
} finally {
466+
doc.destroy();
467+
}
468+
}
469+
470+
it("handles pdf-lib separated field/widget structure", async () => {
471+
const d = await PDFDocument.create();
472+
const form = d.getForm();
473+
d.addPage([612, 792]);
474+
form
475+
.createTextField("alpha")
476+
.addToPage(d.getPage(0), { x: 50, y: 700, width: 200, height: 20 });
477+
form
478+
.createCheckBox("agree")
479+
.addToPage(d.getPage(0), { x: 50, y: 660, width: 20, height: 20 });
480+
form
481+
.createDropdown("choice")
482+
.addToPage(d.getPage(0), { x: 50, y: 620, width: 100, height: 20 });
483+
484+
const schema = await schemaFor(await d.save());
485+
expect(schema).not.toBeNull();
486+
expect(schema!.properties.alpha).toEqual({
487+
type: "string",
488+
title: "alpha",
489+
});
490+
expect(schema!.properties.agree).toEqual({
491+
type: "boolean",
492+
title: "agree",
493+
});
494+
expect(schema!.properties.choice.type).toBe("string");
495+
});
496+
497+
it("handles fields with multiple widgets across pages", async () => {
498+
const d = await PDFDocument.create();
499+
const form = d.getForm();
500+
d.addPage([612, 792]);
501+
d.addPage([612, 792]);
502+
const tf = form.createTextField("shared");
503+
tf.addToPage(d.getPage(0), { x: 50, y: 700, width: 200, height: 20 });
504+
tf.addToPage(d.getPage(1), { x: 50, y: 700, width: 200, height: 20 });
505+
506+
const schema = await schemaFor(await d.save());
507+
expect(schema?.properties.shared).toEqual({
508+
type: "string",
509+
title: "shared",
510+
});
511+
});
512+
513+
it("skips container nodes and finds leaf fields (W-9 style)", async () => {
514+
const bytes = fs.readFileSync(
515+
path.join(__dirname, "../../tests/helpers/assets/fw9.pdf"),
516+
);
517+
const doc = await getDocument({ data: new Uint8Array(bytes) }).promise;
518+
try {
519+
const fo = (await doc.getFieldObjects()) as Parameters<
520+
typeof extractFormSchema
521+
>[1];
522+
// Container nodes (no leaf type) should not crash extraction
523+
expect(fo!["topmostSubform[0]"]).toBeDefined();
524+
// Schema is null for W-9 (mechanical names), but extraction must not throw
525+
const schema = await extractFormSchema(doc, fo);
526+
expect(schema).toBeNull();
527+
} finally {
528+
doc.destroy();
529+
}
530+
});
531+
532+
it("returns null when no AcroForm present", async () => {
533+
const d = await PDFDocument.create();
534+
d.addPage([612, 792]);
535+
const schema = await schemaFor(await d.save());
536+
expect(schema).toBeNull();
537+
});
538+
});
539+
540+
describe("validateUrl loopback HTTP allow (PDF_SERVER_ALLOW_LOOPBACK_HTTP)", () => {
541+
it("rejects http://127.0.0.1 by default", () => {
542+
expect(validateUrl("http://127.0.0.1:9999/x.pdf").valid).toBe(false);
543+
});
544+
545+
it("accepts http://127.0.0.1 only when the env gate is set, and never non-loopback http", () => {
546+
const prev = process.env.PDF_SERVER_ALLOW_LOOPBACK_HTTP;
547+
process.env.PDF_SERVER_ALLOW_LOOPBACK_HTTP = "1";
548+
try {
549+
expect(validateUrl("http://127.0.0.1:9999/x.pdf").valid).toBe(true);
550+
expect(validateUrl("http://169.254.169.254/").valid).toBe(false);
551+
} finally {
552+
if (prev === undefined) delete process.env.PDF_SERVER_ALLOW_LOOPBACK_HTTP;
553+
else process.env.PDF_SERVER_ALLOW_LOOPBACK_HTTP = prev;
554+
}
555+
});
556+
});
557+
292558
describe("validateUrl with MCP roots (allowedLocalDirs)", () => {
293559
const savedFiles = new Set(allowedLocalFiles);
294560
const savedDirs = new Set(allowedLocalDirs);

0 commit comments

Comments
 (0)