GitHub templates by tisnik · Pull Request #2 · lightspeed-core/lightspeed-stack

tisnik · 2025-05-01T08:44:24Z

No description provided.

# This is the 1st commit message: LCORE-1270: Added e2e tests for responses endpoint # This is the commit message #2: fix

Add the spike doc (decisions up front, background below, 4 proposed JIRAs) and the spec doc (R1..R6 requirements, architecture, key files and insertion points, known limitations) under docs/design/byok-pdf/. The spike is lightweight by design: HTML support shipped under LCORE-1035 (commit 7f688b0, 2026-01-15), so the architectural pattern, docling dependency, BaseReader plumbing, CLI shape, and test layout are all already established. PDF support is a scaffold-and-mirror job plus a one-line addition to document_processor.py's doc_type branches. Decisions captured for confirmation (each with options table and recommendation in the spike doc): D1: Library -- docling (already a dependency) D2: OCR for scanned PDFs -- out of scope; track as follow-up D3: Repo placement -- rag-content (impl) + lightspeed-stack (BYOK guide update only) D4: Pipeline knobs -- hard-coded sensible defaults; no CLI flags in v1 (mirrors HTMLReader) D5: Chunking strategy -- reuse MarkdownNodeParser; add "pdf" to document_processor.py:75 and :87 D6: Code organization -- new pdf/ package mirroring html/ D7: Test coverage -- unit/integration in JIRA #2, e2e in #3 Four sub-JIRAs proposed under LCORE-1471 (parseable by dev-tools/file-jiras.sh): 1. rag-content: Implement PDF support 2. rag-content: Unit and integration tests 3. rag-content: End-to-end test (PDF -> vector store -> stack query) 4. lightspeed-stack: Update BYOK guide for native PDF support PoC evidence under poc-results/: 01-poc-report.txt Methodology, findings, implications 02-conversion-log.txt Exact commands and timings 03-sample-jira-1311.md Clean conversion (Atlassian Cloud PDF) 04-sample-jira-836.md Body clean, headings degraded (Confluence PDF, letter-spaced display font) Honest PoC findings worth surfacing: - No new dependencies are needed (docling is already in pyproject.toml). - Body text and tables convert cleanly to Markdown. - MarkdownNodeParser handles the output -- no parallel chunking pipeline. - Letter-spaced display fonts (typical of Confluence "Export to PDF") produce noisy heading text; documented as a v1 known limitation. - Cold model load is ~5 minutes on CPU; warm conversions ~30-90 s for small/medium PDFs. Acceptable for offline indexing. Per howto-run-a-spike.md step 10, poc/ and poc-results/ will be removed before merge; spike doc and spec doc remain in the repo.

Apply CodeRabbit's actionable comments and the per-comment nits: 1. PoC results section in spike doc previously listed paths under poc-results/ that are deleted before merge per howto-run-a-spike.md step 10, leaving broken links in the merged document. Replace the file list with a self-contained summary of what the PoC proved plus the heading-degradation finding, and a note pointing future readers at the PR diff if the raw artifacts are ever needed. 2. Drop the reference to docs/local-stack-testing.md (a local-only file, never committed to the repo). 3. Replace fragile line-numbered references (document_processor.py:75, :87, byok_guide.md ~106-118) with stable symbol anchors: _BaseDB.__init__, _LlamaStackDB.__init__, "Knowledge Sources" subsection, "Step 1" subsection. Line numbers rot; section names and symbol names rot less. 4. Spec doc now instructs the implementation ticket to extract the ("markdown", "html", "pdf") predicate to a single MARKDOWN_COMPATIBLE_DOC_TYPES: Final[tuple[str, ...]] constant in document_processor.py and reference it from both call sites, instead of duplicating the tuple. JIRA #1 scope updated to match. 5. Add R7: PDFReader.load_data emits a logger.warning when its docling output is empty / under a small threshold (a likely indicator of a scanned PDF given R5's no-OCR scope). Threshold is a module-level Final[int] constant. JIRA #1 scope and JIRA #2 test patterns updated to require coverage via caplog. Surfacing the silent- degradation case in custom_processor.py logs costs nothing and makes the OCR-needed signal visible. Plus the two reviewer nits worth carrying into JIRA #1: - Use docling's TableFormerMode.ACCURATE enum, not the string literal "accurate"; both work via Pydantic coercion but the enum is type-checked. - Mirror HTMLReader's choice on whether to call super().__init__(); llama-index's BaseReader does not require it but symmetry between the two readers is preferred. The spec doc changelog records this revision and its trigger (the PR #1598 CodeRabbit review).

tisnik added 3 commits May 1, 2025 10:43

Bug report template

80020d0

New feature template

485b294

Pull request template

bebab6e

tisnik merged commit 94b3192 into lightspeed-core:main May 1, 2025

tisnik mentioned this pull request Sep 11, 2025

[RHDHPAI-976] Add customization profiles via path #487

Merged

18 tasks

radofuchs referenced this pull request in radofuchs/lightspeed-stack Mar 27, 2026

# This is a combination of 2 commits.

9e94bbe

# This is the 1st commit message: LCORE-1270: Added e2e tests for responses endpoint # This is the commit message #2: fix

tisnik mentioned this pull request Apr 10, 2026

LCORE-1591: Observability for Lightspeed Core #1482

Open

19 tasks

max-svistunov mentioned this pull request Apr 27, 2026

LCORE-1471 spike: BYOK PDF support #1598

Merged

5 tasks

tisnik added a commit that referenced this pull request May 13, 2026

New unit test #2

06cf91f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub templates#2

GitHub templates#2
tisnik merged 3 commits into
lightspeed-core:mainfrom
tisnik:github-templates

tisnik commented May 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tisnik commented May 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant