Skip to content

Project file#3

Merged
tisnik merged 1 commit into
lightspeed-core:mainfrom
tisnik:project-file
May 2, 2025
Merged

Project file#3
tisnik merged 1 commit into
lightspeed-core:mainfrom
tisnik:project-file

Conversation

@tisnik
Copy link
Copy Markdown
Contributor

@tisnik tisnik commented May 2, 2025

Description

Project file

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change

@tisnik tisnik merged commit 0cddc66 into lightspeed-core:main May 2, 2025
tisnik pushed a commit that referenced this pull request Apr 30, 2026
Add the spike doc (decisions up front, background below, 4 proposed
JIRAs) and the spec doc (R1..R6 requirements, architecture, key files
and insertion points, known limitations) under docs/design/byok-pdf/.

The spike is lightweight by design: HTML support shipped under
LCORE-1035 (commit 7f688b0, 2026-01-15), so the architectural pattern,
docling dependency, BaseReader plumbing, CLI shape, and test layout
are all already established. PDF support is a scaffold-and-mirror job
plus a one-line addition to document_processor.py's doc_type branches.

Decisions captured for confirmation (each with options table and
recommendation in the spike doc):

  D1: Library                -- docling (already a dependency)
  D2: OCR for scanned PDFs   -- out of scope; track as follow-up
  D3: Repo placement          -- rag-content (impl) + lightspeed-stack
                                  (BYOK guide update only)
  D4: Pipeline knobs          -- hard-coded sensible defaults; no CLI
                                  flags in v1 (mirrors HTMLReader)
  D5: Chunking strategy       -- reuse MarkdownNodeParser; add "pdf" to
                                  document_processor.py:75 and :87
  D6: Code organization       -- new pdf/ package mirroring html/
  D7: Test coverage           -- unit/integration in JIRA #2, e2e in #3

Four sub-JIRAs proposed under LCORE-1471 (parseable by
dev-tools/file-jiras.sh):

  1. rag-content: Implement PDF support
  2. rag-content: Unit and integration tests
  3. rag-content: End-to-end test (PDF -> vector store -> stack query)
  4. lightspeed-stack: Update BYOK guide for native PDF support

PoC evidence under poc-results/:

  01-poc-report.txt    Methodology, findings, implications
  02-conversion-log.txt  Exact commands and timings
  03-sample-jira-1311.md  Clean conversion (Atlassian Cloud PDF)
  04-sample-jira-836.md   Body clean, headings degraded
                          (Confluence PDF, letter-spaced display font)

Honest PoC findings worth surfacing:

- No new dependencies are needed (docling is already in pyproject.toml).
- Body text and tables convert cleanly to Markdown.
- MarkdownNodeParser handles the output -- no parallel chunking pipeline.
- Letter-spaced display fonts (typical of Confluence "Export to PDF")
  produce noisy heading text; documented as a v1 known limitation.
- Cold model load is ~5 minutes on CPU; warm conversions ~30-90 s for
  small/medium PDFs. Acceptable for offline indexing.

Per howto-run-a-spike.md step 10, poc/ and poc-results/ will be
removed before merge; spike doc and spec doc remain in the repo.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant