Skip to content

danielrosehill/document-to-markdown-plugin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Document-To-Markdown

Convert PDFs to clean Markdown, chunk into logical sections, and extract embedded tables to CSV.

Skills

  • setup — provision the local extractor venv and verify system tools (pdftotext, ocrmypdf, tesseract).
  • pdf-to-markdown — convert a single PDF to Markdown, picking marker / docling / pymupdf4llm based on layout complexity.
  • ocr-scanned-pdf — run ocrmypdf to add a text layer to scanned/image PDFs. Auto-invoked when needed.
  • chunk-markdown — split a long .md into logical chapters/sections with a TOON manifest.
  • extract-tables — pull tables from a PDF (camelot/tabula) into CSV files with a TOON index.
  • doc-to-everything — end-to-end orchestrator: PDF → Markdown → chunks → tables in a self-contained workspace.

Output layout

Running doc-to-everything on book.pdf produces:

book/
  source.pdf
  full.md
  assets/
  chunks/
    index.toon
    00-frontmatter.md
    01-introduction.md
    ...
  tables/
    index.toon
    01-p12-revenue.csv
    ...
  manifest.toon

Installation

claude plugins install document-to-markdown@danielrosehill

Dependencies

System: pdftotext (poppler-utils), ocrmypdf, tesseract-ocr. Python (managed via uv venv under $CLAUDE_USER_DATA/document-to-markdown/venv/): marker-pdf, docling, pymupdf4llm, camelot-py[cv], tabula-py, pandas. Run the setup skill on first use.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors