toapi

Turn any website into a JSON API — declaratively.

toapi lets you point at a web page, declare the fields you want with CSS selectors, and get back a clean JSON API. No crawler to babysit, no database to maintain — pages are fetched and parsed on demand, with built‑in caching.

Install

pip install toapi

Requires Python 3.10+.

Quickstart

from htmlparsing import Attr, Text
from toapi import Api, Item

api = Api()


@api.site("https://news.ycombinator.com")
@api.list(".athing")
@api.route("/posts", "/news")
@api.route("/posts?page={page}", "/news?p={page}")
class Post(Item):
    title = Text(".titleline > a")
    url = Attr(".titleline > a", "href")


api.run(host="127.0.0.1", port=5000)

Run it:

python app.py

Then visit http://127.0.0.1:5000/posts and you get:

{
  "Post": [
    {"title": "Mathematicians Crack the Cursed Curve", "url": "https://www.quantamagazine.org/..."},
    {"title": "Stuffing a Tesla Drivetrain into a 1981 Honda Accord", "url": "https://jalopnik.com/..."}
  ]
}

How it works

   ┌────────────┐    ┌────────────┐    ┌────────────┐
   │  /posts    │ ─▶ │  fetch     │ ─▶ │  parse     │ ─▶  JSON
   │  (route)   │    │  (cache)   │    │  (Item)    │
   └────────────┘    └────────────┘    └────────────┘

Route — @api.route("/posts", "/news") maps your API path to a source URL.
Fetch — pages are fetched with requests (or a headless browser if you pass browser=) and cached in memory.
Parse — each Item extracts fields with CSS selectors via htmlparsing.
Serve — Flask returns the result as JSON; subsequent calls hit the cache.

Features

Declarative — describe data, not scraping logic.
Routes — map clean API paths to messy source URLs with {param} placeholders.
Multi-site — merge several websites behind one API.
Cleaning hooks — define clean_<field> methods to post-process values.
Caching — pages and parsed results are cached automatically.
Headless browser — pass Api(browser="/path/to/geckodriver") for JS-heavy sites.

Cleaning values

Add a clean_<fieldname> method on the Item to transform a value before it's returned:

@api.site("https://news.ycombinator.com")
@api.route("/posts", "/news")
class Page(Item):
    next_page = Attr(".morelink", "href")

    def clean_next_page(self, value):
        return f"/posts?{value.split('?', 1)[1]}"

Development

git clone https://github.com/elliotgao2/toapi.git
cd toapi
uv sync          # install deps into .venv
uv run pytest    # run tests
uv run ruff check .

We use uv for packaging and ruff for lint + format. Pre-commit hooks keep both clean:

uv run pre-commit install

Contributing

Pull requests are welcome. For non-trivial changes, please open an issue first to discuss what you'd like to change. Make sure uv run pytest and uv run ruff check . pass before submitting.

Name		Name	Last commit message	Last commit date
Latest commit History 469 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
tests		tests
toapi		toapi
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

toapi

Install

Quickstart

How it works

Features

Cleaning values

Development

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

toapi

Install

Quickstart

How it works

Features

Cleaning values

Development

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages