Skip to content

GitHub templates#2

Merged
tisnik merged 3 commits into
lightspeed-core:mainfrom
tisnik:github-templates
May 1, 2025
Merged

GitHub templates#2
tisnik merged 3 commits into
lightspeed-core:mainfrom
tisnik:github-templates

Conversation

@tisnik
Copy link
Copy Markdown
Contributor

@tisnik tisnik commented May 1, 2025

No description provided.

@tisnik tisnik merged commit 94b3192 into lightspeed-core:main May 1, 2025
radofuchs referenced this pull request in radofuchs/lightspeed-stack Mar 27, 2026
# This is the 1st commit message:

LCORE-1270: Added e2e tests for responses endpoint

# This is the commit message #2:

fix
tisnik pushed a commit that referenced this pull request Apr 30, 2026
Add the spike doc (decisions up front, background below, 4 proposed
JIRAs) and the spec doc (R1..R6 requirements, architecture, key files
and insertion points, known limitations) under docs/design/byok-pdf/.

The spike is lightweight by design: HTML support shipped under
LCORE-1035 (commit 7f688b0, 2026-01-15), so the architectural pattern,
docling dependency, BaseReader plumbing, CLI shape, and test layout
are all already established. PDF support is a scaffold-and-mirror job
plus a one-line addition to document_processor.py's doc_type branches.

Decisions captured for confirmation (each with options table and
recommendation in the spike doc):

  D1: Library                -- docling (already a dependency)
  D2: OCR for scanned PDFs   -- out of scope; track as follow-up
  D3: Repo placement          -- rag-content (impl) + lightspeed-stack
                                  (BYOK guide update only)
  D4: Pipeline knobs          -- hard-coded sensible defaults; no CLI
                                  flags in v1 (mirrors HTMLReader)
  D5: Chunking strategy       -- reuse MarkdownNodeParser; add "pdf" to
                                  document_processor.py:75 and :87
  D6: Code organization       -- new pdf/ package mirroring html/
  D7: Test coverage           -- unit/integration in JIRA #2, e2e in #3

Four sub-JIRAs proposed under LCORE-1471 (parseable by
dev-tools/file-jiras.sh):

  1. rag-content: Implement PDF support
  2. rag-content: Unit and integration tests
  3. rag-content: End-to-end test (PDF -> vector store -> stack query)
  4. lightspeed-stack: Update BYOK guide for native PDF support

PoC evidence under poc-results/:

  01-poc-report.txt    Methodology, findings, implications
  02-conversion-log.txt  Exact commands and timings
  03-sample-jira-1311.md  Clean conversion (Atlassian Cloud PDF)
  04-sample-jira-836.md   Body clean, headings degraded
                          (Confluence PDF, letter-spaced display font)

Honest PoC findings worth surfacing:

- No new dependencies are needed (docling is already in pyproject.toml).
- Body text and tables convert cleanly to Markdown.
- MarkdownNodeParser handles the output -- no parallel chunking pipeline.
- Letter-spaced display fonts (typical of Confluence "Export to PDF")
  produce noisy heading text; documented as a v1 known limitation.
- Cold model load is ~5 minutes on CPU; warm conversions ~30-90 s for
  small/medium PDFs. Acceptable for offline indexing.

Per howto-run-a-spike.md step 10, poc/ and poc-results/ will be
removed before merge; spike doc and spec doc remain in the repo.
tisnik pushed a commit that referenced this pull request Apr 30, 2026
Apply CodeRabbit's actionable comments and the per-comment nits:

1. PoC results section in spike doc previously listed paths under
   poc-results/ that are deleted before merge per howto-run-a-spike.md
   step 10, leaving broken links in the merged document. Replace the
   file list with a self-contained summary of what the PoC proved
   plus the heading-degradation finding, and a note pointing future
   readers at the PR diff if the raw artifacts are ever needed.

2. Drop the reference to docs/local-stack-testing.md (a local-only
   file, never committed to the repo).

3. Replace fragile line-numbered references (document_processor.py:75,
   :87, byok_guide.md ~106-118) with stable symbol anchors:
   _BaseDB.__init__, _LlamaStackDB.__init__, "Knowledge Sources"
   subsection, "Step 1" subsection. Line numbers rot; section names
   and symbol names rot less.

4. Spec doc now instructs the implementation ticket to extract the
   ("markdown", "html", "pdf") predicate to a single
   MARKDOWN_COMPATIBLE_DOC_TYPES: Final[tuple[str, ...]] constant in
   document_processor.py and reference it from both call sites,
   instead of duplicating the tuple. JIRA #1 scope updated to match.

5. Add R7: PDFReader.load_data emits a logger.warning when its docling
   output is empty / under a small threshold (a likely indicator of a
   scanned PDF given R5's no-OCR scope). Threshold is a module-level
   Final[int] constant. JIRA #1 scope and JIRA #2 test patterns
   updated to require coverage via caplog. Surfacing the silent-
   degradation case in custom_processor.py logs costs nothing and
   makes the OCR-needed signal visible.

Plus the two reviewer nits worth carrying into JIRA #1:

- Use docling's TableFormerMode.ACCURATE enum, not the string literal
  "accurate"; both work via Pydantic coercion but the enum is
  type-checked.
- Mirror HTMLReader's choice on whether to call super().__init__();
  llama-index's BaseReader does not require it but symmetry between
  the two readers is preferred.

The spec doc changelog records this revision and its trigger (the
PR #1598 CodeRabbit review).
tisnik added a commit that referenced this pull request May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant