GitHub templates#2
Merged
Merged
Conversation
18 tasks
radofuchs
referenced
this pull request
in radofuchs/lightspeed-stack
Mar 27, 2026
# This is the 1st commit message: LCORE-1270: Added e2e tests for responses endpoint # This is the commit message #2: fix
19 tasks
5 tasks
tisnik
pushed a commit
that referenced
this pull request
Apr 30, 2026
Add the spike doc (decisions up front, background below, 4 proposed
JIRAs) and the spec doc (R1..R6 requirements, architecture, key files
and insertion points, known limitations) under docs/design/byok-pdf/.
The spike is lightweight by design: HTML support shipped under
LCORE-1035 (commit 7f688b0, 2026-01-15), so the architectural pattern,
docling dependency, BaseReader plumbing, CLI shape, and test layout
are all already established. PDF support is a scaffold-and-mirror job
plus a one-line addition to document_processor.py's doc_type branches.
Decisions captured for confirmation (each with options table and
recommendation in the spike doc):
D1: Library -- docling (already a dependency)
D2: OCR for scanned PDFs -- out of scope; track as follow-up
D3: Repo placement -- rag-content (impl) + lightspeed-stack
(BYOK guide update only)
D4: Pipeline knobs -- hard-coded sensible defaults; no CLI
flags in v1 (mirrors HTMLReader)
D5: Chunking strategy -- reuse MarkdownNodeParser; add "pdf" to
document_processor.py:75 and :87
D6: Code organization -- new pdf/ package mirroring html/
D7: Test coverage -- unit/integration in JIRA #2, e2e in #3
Four sub-JIRAs proposed under LCORE-1471 (parseable by
dev-tools/file-jiras.sh):
1. rag-content: Implement PDF support
2. rag-content: Unit and integration tests
3. rag-content: End-to-end test (PDF -> vector store -> stack query)
4. lightspeed-stack: Update BYOK guide for native PDF support
PoC evidence under poc-results/:
01-poc-report.txt Methodology, findings, implications
02-conversion-log.txt Exact commands and timings
03-sample-jira-1311.md Clean conversion (Atlassian Cloud PDF)
04-sample-jira-836.md Body clean, headings degraded
(Confluence PDF, letter-spaced display font)
Honest PoC findings worth surfacing:
- No new dependencies are needed (docling is already in pyproject.toml).
- Body text and tables convert cleanly to Markdown.
- MarkdownNodeParser handles the output -- no parallel chunking pipeline.
- Letter-spaced display fonts (typical of Confluence "Export to PDF")
produce noisy heading text; documented as a v1 known limitation.
- Cold model load is ~5 minutes on CPU; warm conversions ~30-90 s for
small/medium PDFs. Acceptable for offline indexing.
Per howto-run-a-spike.md step 10, poc/ and poc-results/ will be
removed before merge; spike doc and spec doc remain in the repo.
tisnik
pushed a commit
that referenced
this pull request
Apr 30, 2026
Apply CodeRabbit's actionable comments and the per-comment nits:
1. PoC results section in spike doc previously listed paths under
poc-results/ that are deleted before merge per howto-run-a-spike.md
step 10, leaving broken links in the merged document. Replace the
file list with a self-contained summary of what the PoC proved
plus the heading-degradation finding, and a note pointing future
readers at the PR diff if the raw artifacts are ever needed.
2. Drop the reference to docs/local-stack-testing.md (a local-only
file, never committed to the repo).
3. Replace fragile line-numbered references (document_processor.py:75,
:87, byok_guide.md ~106-118) with stable symbol anchors:
_BaseDB.__init__, _LlamaStackDB.__init__, "Knowledge Sources"
subsection, "Step 1" subsection. Line numbers rot; section names
and symbol names rot less.
4. Spec doc now instructs the implementation ticket to extract the
("markdown", "html", "pdf") predicate to a single
MARKDOWN_COMPATIBLE_DOC_TYPES: Final[tuple[str, ...]] constant in
document_processor.py and reference it from both call sites,
instead of duplicating the tuple. JIRA #1 scope updated to match.
5. Add R7: PDFReader.load_data emits a logger.warning when its docling
output is empty / under a small threshold (a likely indicator of a
scanned PDF given R5's no-OCR scope). Threshold is a module-level
Final[int] constant. JIRA #1 scope and JIRA #2 test patterns
updated to require coverage via caplog. Surfacing the silent-
degradation case in custom_processor.py logs costs nothing and
makes the OCR-needed signal visible.
Plus the two reviewer nits worth carrying into JIRA #1:
- Use docling's TableFormerMode.ACCURATE enum, not the string literal
"accurate"; both work via Pydantic coercion but the enum is
type-checked.
- Mirror HTMLReader's choice on whether to call super().__init__();
llama-index's BaseReader does not require it but symmetry between
the two readers is preferred.
The spec doc changelog records this revision and its trigger (the
PR #1598 CodeRabbit review).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.