Project file by tisnik · Pull Request #3 · lightspeed-core/lightspeed-stack

tisnik · 2025-05-02T07:54:44Z

Description

Project file

Type of change

Add the spike doc (decisions up front, background below, 4 proposed JIRAs) and the spec doc (R1..R6 requirements, architecture, key files and insertion points, known limitations) under docs/design/byok-pdf/. The spike is lightweight by design: HTML support shipped under LCORE-1035 (commit 7f688b0, 2026-01-15), so the architectural pattern, docling dependency, BaseReader plumbing, CLI shape, and test layout are all already established. PDF support is a scaffold-and-mirror job plus a one-line addition to document_processor.py's doc_type branches. Decisions captured for confirmation (each with options table and recommendation in the spike doc): D1: Library -- docling (already a dependency) D2: OCR for scanned PDFs -- out of scope; track as follow-up D3: Repo placement -- rag-content (impl) + lightspeed-stack (BYOK guide update only) D4: Pipeline knobs -- hard-coded sensible defaults; no CLI flags in v1 (mirrors HTMLReader) D5: Chunking strategy -- reuse MarkdownNodeParser; add "pdf" to document_processor.py:75 and :87 D6: Code organization -- new pdf/ package mirroring html/ D7: Test coverage -- unit/integration in JIRA #2, e2e in #3 Four sub-JIRAs proposed under LCORE-1471 (parseable by dev-tools/file-jiras.sh): 1. rag-content: Implement PDF support 2. rag-content: Unit and integration tests 3. rag-content: End-to-end test (PDF -> vector store -> stack query) 4. lightspeed-stack: Update BYOK guide for native PDF support PoC evidence under poc-results/: 01-poc-report.txt Methodology, findings, implications 02-conversion-log.txt Exact commands and timings 03-sample-jira-1311.md Clean conversion (Atlassian Cloud PDF) 04-sample-jira-836.md Body clean, headings degraded (Confluence PDF, letter-spaced display font) Honest PoC findings worth surfacing: - No new dependencies are needed (docling is already in pyproject.toml). - Body text and tables convert cleanly to Markdown. - MarkdownNodeParser handles the output -- no parallel chunking pipeline. - Letter-spaced display fonts (typical of Confluence "Export to PDF") produce noisy heading text; documented as a v1 known limitation. - Cold model load is ~5 minutes on CPU; warm conversions ~30-90 s for small/medium PDFs. Acceptable for offline indexing. Per howto-run-a-spike.md step 10, poc/ and poc-results/ will be removed before merge; spike doc and spec doc remain in the repo.

Project file

90721aa

tisnik merged commit 0cddc66 into lightspeed-core:main May 2, 2025

max-svistunov mentioned this pull request Apr 27, 2026

LCORE-1471 spike: BYOK PDF support #1598

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project file#3

Project file#3
tisnik merged 1 commit into
lightspeed-core:mainfrom
tisnik:project-file

tisnik commented May 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tisnik commented May 2, 2025

Description

Type of change

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant