Skip to content
@openalexPro

openalexPro

Tools to work with OpenAlex data, API as well as snapshot

openalexPro

An R ecosystem for large-scale, on-disk access to OpenAlex — the open catalogue of global scholarly work.

OpenAlex provides free, comprehensive metadata on over 250 million scholarly works, authors, institutions, and concepts. The openalexPro ecosystem is built around a single design principle: process data on disk rather than in memory, so workflows scale to millions of records without hitting RAM limits.

Packages

Package Description CI Docs Install
openalexPro Core API client — query OpenAlex, page through results, and store everything in Parquet files for efficient downstream use CI 📖 r-universe DOI
openalexSnowball Snowball citation searches — iteratively expand a seed set by following forward and backward citations across the graph CI 📖 r-universe DOI
openalexConvert Export a Parquet corpus to BibTeX, BibLaTeX, CSL JSON, Markdown, LaTeX, HTML, or PDF via Pandoc CI 📖 r-universe DOI
openalexSnapshot Bulk snapshot tools — convert the full OpenAlex JSON.GZ snapshot to Parquet, build ID-lookup indexes, and extract records at scale using a Rust back-end CI 📖 r-universe DOI
openalexVectorComp Text embedding, cosine-distance scoring, and threshold calibration — backend-neutral (HuggingFace, OpenAI, TEI) CI 📖 r-universe DOI

Rust core

Package Description CI Docs Release
openalex-snapshot Compiled Rust CLI and library powering openalexSnapshot's hot path (JSON→Parquet conversion, indexing, ID extraction). Downloaded automatically as a pre-built static library on install — no manual Rust setup required for most users. CI 📖 GitHub release

Installation

All R packages are available from the openalexPro r-universe:

install.packages(
  c("openalexPro", "openalexSnowball", "openalexConvert",
    "openalexSnapshot", "openalexVectorComp"),
  repos = c("https://openalexpro.r-universe.dev", "https://cloud.r-project.org")
)

Typical workflow

library(openalexPro)

# 1. Query OpenAlex and store results as Parquet
fetch_works(
  query = '"biodiversity" AND "ecosystem services"',
  output = "my_corpus"
)

# 2. Or work directly with the bulk OpenAlex snapshot
library(openalexSnapshot)
oa_snapshot_to_parquet(
  snapshot_dir = "/data/openalex-snapshot",
  parquet_dir  = "/data/openalex-parquet"
)

# 3. Snowball-expand a seed set (optional)
library(openalexSnowball)
snowball(corpus = "my_corpus", depth = 1, output = "my_corpus_expanded")

# 4. Export to BibTeX, CSL JSON, etc.
library(openalexConvert)
corpus_to_csljson("my_corpus", output = "csl/")
csljson_convert_pandoc("csl/", "refs/", to = "bibtex")

Design principles

  • On-disk processing — results are paged and written to Parquet; memory use stays flat regardless of corpus size.
  • Arrow / DuckDB throughout — all data manipulation uses columnar formats; SQL queries run in-process.
  • Composable — each package has a single responsibility and speaks the same Parquet dialect, so they chain naturally.
  • Rust where it matters — the bulk snapshot converter delegates hot-path work to a compiled Rust back-end (openalex-snapshot), with a pure-R/DuckDB fallback for environments without a Rust toolchain.

Contributing

Issues and pull requests are welcome on the individual package repositories. Please open an issue before starting large changes.

Acknowledgements

This ecosystem builds on the excellent openalexR package and the OpenAlex team's commitment to open scholarly infrastructure.

Disclaimer

The packages are provided as is. The authors are not affiliated with OpenAlex.

Pinned Loading

  1. openalexPro openalexPro Public

    File Based Retrieval and Processing of large Literature Corpora from OpenAlex

    R 8

Repositories

Showing 9 of 9 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…