Skip to content

Expose XetSession APIs to Python#792

Merged
seanses merged 23 commits intomainfrom
di/use-hf-xet-in-hf-xet
May 1, 2026
Merged

Expose XetSession APIs to Python#792
seanses merged 23 commits intomainfrom
di/use-hf-xet-in-hf-xet

Conversation

@seanses
Copy link
Copy Markdown
Collaborator

@seanses seanses commented Apr 9, 2026

Summary

Replaces the old upload_files / download_files / hash_files Python functions with a new object-oriented API that exposes XetSession and its child objects directly as PyO3 classes. This gives Python callers full control over session lifecycle, connection pooling, and progress reporting.

The previous module-level functions are kept under hf_xet/src/legacy/ and remain importable as from hf_xet import upload_files etc., but now emit DeprecationWarning.

New Python API

import hf_xet

# Optional: create a custom config (immutable; use .with_config() to derive updates)
config = hf_xet.XetConfig().with_config("data.max_concurrent_file_ingestion", 8)

# Create session; config is optional (defaults to XetConfig() with HF_XET_* env overrides)
session = hf_xet.XetSession(config=config)

# Upload — multiple files, bytes, and streaming within one commit
with session.new_upload_commit(
        endpoint="https://cas.xethub.hf.co",
        token="jwt", token_expiry_unix_secs=9999999999,
        token_refresh_url="https://…/xet-write-token/main",
        token_refresh_headers={"Authorization": "Bearer hf_…"},
    ) as commit:
    h1 = commit.start_upload_file("/path/to/model.bin")
    h2 = commit.start_upload_file("/path/to/tokenizer.json", sha256="f2358d9a…")
    h3 = commit.start_upload_bytes(b"...", name="config.json")

    stream = commit.start_upload_stream(name="big.bin")
    for chunk in produce_chunks():
        stream.write(chunk)
    stream.finish()  # must be called before the with-block exits
# on normal exit: wait_to_finish() is called automatically
# on exception:   abort() is called automatically

# SHA-256 sentinels
commit.start_upload_file("/path/to/model.bin", sha256=hf_xet.COMPUTE_SHA256)  # default
commit.start_upload_file("/path/to/model.bin", sha256=hf_xet.SKIP_SHA256)     # skip

# Progress callback — receives (GroupProgressReport, dict[UniqueID, ItemProgressReport])
def on_progress(group, items):
    bar.n = group.total_bytes_completed
    bar.refresh()

with session.new_upload_commit(
        token_refresh_url="https://…/xet-write-token/main",
        token_refresh_headers={"Authorization": "Bearer hf_…"},
        progress_callback=on_progress,
        progress_interval_ms=100,
    ) as commit:
    commit.start_upload_file("/path/to/model.bin")

# File download — multiple files within one group (downloads run concurrently)
file_info_a = hf_xet.XetFileInfo(hash_a, size_a)
file_info_b = hf_xet.XetFileInfo(hash_b, size_b)
with session.new_file_download_group(
        token_refresh_url="https://…/xet-read-token/main",
        token_refresh_headers={"Authorization": "Bearer hf_…"},
    ) as group:
    group.start_download_file(file_info_a, dest_path_a)
    group.start_download_file(file_info_b, dest_path_b)

# Streaming download — start/end are both optional (open-ended ranges supported)
group = session.new_download_stream_group(
    token_refresh_url="https://…/xet-read-token/main",
    token_refresh_headers={"Authorization": "Bearer hf_…"},
)
for chunk in group.download_stream(file_info):                    # whole file
    f.write(chunk)
for chunk in group.download_stream(file_info, start=1024):        # 1024 .. EOF
    f.write(chunk)
for chunk in group.download_stream(file_info, start=0, end=4096): # 0 .. 4096
    f.write(chunk)
for offset, chunk in group.download_unordered_stream(file_info):
    buf[offset:offset+len(chunk)] = chunk

# Ctrl-C handling
try:
    with session.new_upload_commit(...) as commit:
        commit.start_upload_file("model.bin")
except KeyboardInterrupt:
    session.sigint_abort()
    raise

Files Changed

New files

  • hf_xet/src/py_xet_session.rsXetSession PyO3 class
  • hf_xet/src/py_upload_commit.rsXetUploadCommit, SHA-256 sentinels (COMPUTE_SHA256, SKIP_SHA256), report types
  • hf_xet/src/py_file_upload_handle.rsXetFileUpload
  • hf_xet/src/py_stream_upload_handle.rsXetStreamUpload
  • hf_xet/src/py_file_download_group.rsXetFileDownloadGroup
  • hf_xet/src/py_file_download_handle.rsXetFileDownload
  • hf_xet/src/py_download_stream_group.rsXetDownloadStreamGroup
  • hf_xet/src/py_download_stream_handle.rsXetDownloadStream, XetUnorderedDownloadStream
  • hf_xet/src/config.rsXetConfig Python class (immutable, with_config(), get(), items(), keys(), __getitem__)
  • hf_xet/src/headers.rsbuild_headers_with_user_agent helper
  • hf_xet/src/legacy/mod.rs — re-exports all legacy symbols
  • hf_xet/src/legacy/types.rsPyXetDownloadInfo, PyXetUploadInfo, PyPointerFile
  • hf_xet/src/legacy/functions.rs — deprecated upload_bytes, upload_files, download_files, force_sigint_shutdown; hash_files retained without deprecation
  • hf_xet/src/legacy/progress_update.rsPyItemProgressUpdate, PyTotalProgressUpdate, WrappedProgressUpdater
  • hf_xet/src/legacy/runtime.rs — async runtime + SIGINT handler (used by legacy functions)
  • hf_xet/src/legacy/token_refresh.rs — Python callback token refresher (used by legacy functions)
  • hf_xet/tests/conftest.py — shared fixtures and upload helpers
  • hf_xet/tests/test_upload_commit.py — upload tests (file, bytes, stream, SHA-256 policy, progress, abort)
  • hf_xet/tests/test_file_download.py — file download tests (handles, round-trips, progress, cancel)
  • hf_xet/tests/test_stream_download.py — ordered and unordered streaming download tests with range variants
  • hf_xet/tests/test_progress.py — progress callback argument types and field verification
  • hf_xet/tests/test_session.pyXetSession lifecycle and group/commit creation tests
  • hf_xet/tests/test_config.pyXetConfig construction, with_config, get, items, keys

Modified files

  • hf_xet/src/lib.rs — module declarations; XetTaskState Python enum; blocking_call_with_signal_check utility; legacy module registered at top level for backward compatibility
  • hf_xet/src/logging.rs — calls xet_pkg::init_logging() instead of xet_runtime directly
  • hf_xet/Cargo.toml — added xet-runtime, xet-client deps (for legacy module); feature flags route through xet-pkg
  • xet_pkg/src/xet_session/file_download_group.rs — exposes XetDownloadGroupReport as a Python class (pyclass(get_all), __repr__)
  • xet_data/src/processing/xet_file.rs#[new] Python constructor for XetFileInfo
  • xet_pkg/Cargo.toml — added no-default-cache, tokio-console, elevated_information_level features
  • xet_pkg/src/lib.rs — added init_logging() wrapper
  • xet_runtime/src/core/runtime.rs — fork-safe Drop: detect child process via stored PID, discard runtime instead of blocking shutdown
  • .github/workflows/ci.yml — added Python integration test step (maturin + pytest) to Linux, Windows, macOS jobs

Test Plan

Design Notes

  • API style: groups and commits are created with keyword arguments directly on the factory method (session.new_upload_commit(endpoint=..., token_refresh_url=..., progress_callback=...)) rather than a builder chain. The Rust builder is used internally and never surfaces in the Python API.

  • Token refresh: the old API required Python to pass a token-refresh callable that Rust invoked across the GIL boundary. The new API uses token_refresh_url + token_refresh_headers — Rust refreshes autonomously via HTTP, removing GIL re-entry on the hot path. WrappedTokenRefresher is kept only in legacy/.

  • SHA-256 policy: start_upload_file, start_upload_bytes, and start_upload_stream accept a sha256 argument that is either a pre-computed hex string, COMPUTE_SHA256 (default), or SKIP_SHA256. Sentinel objects have descriptive __repr__ for debugging.

  • XetConfig: immutable Python class wrapping XetConfig. with_config(name, value) or with_config({...}) returns a new config with updates applied. Supports dotted-path access (config["data.max_concurrent_file_ingestion"], config.get("data.max_concurrent_file_ingestion")), items(), keys(), and iteration.

  • XetTaskState: a Python class with variants Running, Finalizing, Completed, UserCancelled. The Error variant of the internal Rust enum surfaces as a raised Python exception rather than an enum value (PyO3 0.26 does not support mixed unit/complex enum variants in #[pyclass]).

  • Progress callbacks: progress_callback spawns a background thread that delivers (GroupProgressReport, dict[UniqueID, ItemProgressReport]) to the Python callable every progress_interval_ms milliseconds (default 100). The same signature covers both upload and download groups.

  • GIL release and Ctrl-C: queue operations (start_upload_file, start_upload_bytes, start_download_file) use py.detach() and return quickly. Long-wait operations (wait_to_finish()) run the blocking call on a background thread while the calling thread releases the GIL for 100 ms windows and polls py.check_signals() — Ctrl-C raises KeyboardInterrupt within one interval without starving other Python threads. XetError::KeyboardInterrupt maps to PyKeyboardInterrupt. The recommended caller pattern is except KeyboardInterrupt: session.sigint_abort(); raise, which is idiomatic Python: sigint_abort() flags the runtime so the background thread exits cleanly at its next checkpoint.

  • Context managers and concurrency: XetUploadCommit and XetFileDownloadGroup implement __enter__/__exit__; __exit__ delegates to wait_to_finish() on success and abort() on exception. Multiple start_upload_file / start_download_file calls within a with block run concurrently; the block exit waits for all to complete. XetDownloadStreamGroup is not a context manager — streams are opened with download_stream() / download_unordered_stream() and iterated directly.

  • Streaming uploads: commit.start_upload_stream() returns a XetStreamUpload handle for incremental writes (.write(bytes), then .finish() before wait_to_finish()).

  • Streaming downloads: download_stream and download_unordered_stream accept optional start / end byte offsets; either may be omitted independently. download_unordered_stream yields (offset, bytes) tuples in completion order.

  • Fork-safe runtime drop: XetRuntime records its creating PID; if drop fires in a child process after fork, the parent's Tokio threads don't exist so shutdown_timeout() would block. The runtime is discarded via mem::forget instead.

  • Backward compatibility: all pre-1.x functions (upload_bytes, upload_files, hash_files, download_files, force_sigint_shutdown) and types (PyXetDownloadInfo, PyXetUploadInfo, PyPointerFile, PyItemProgressUpdate, PyTotalProgressUpdate) remain importable from the top-level hf_xet module. Deprecated functions emit DeprecationWarning at stacklevel=2. hash_files is not deprecated.


Note

Medium Risk
Medium risk because it reworks the Python-facing API surface and adds new concurrency/progress-callback behavior around uploads/downloads, which could introduce behavioral regressions despite extensive tests.

Overview
Adds a new object-oriented Python API around XetSession (plus upload commits, file-download groups, and ordered/unordered download streams), including progress callbacks, SHA-256 policy sentinels, and signal-friendly blocking waits.

Moves the pre-1.x module-level functions/types into a legacy module, keeps them importable for compatibility, and makes the main entry points emit DeprecationWarning while sharing header/User-Agent handling and centralized error conversion.

Extends xet_pkg/xet_data with Python-facing pyclass report/value types and minor API tweaks (e.g., optional stream progress, cloneable handles, active task info), bumps hf_xet to 1.5.0, and updates CI to build the extension via maturin and run new pytest integration tests on Linux/Windows/macOS.

Reviewed by Cursor Bugbot for commit dc478ab. Bugbot is set up for automated code reviews on this repo. Configure here.

@seanses seanses force-pushed the di/use-hf-xet-in-hf-xet branch from c65cf6e to b3bb567 Compare April 10, 2026 15:25
@seanses seanses force-pushed the di/use-hf-xet-in-hf-xet branch from 85873ea to 1d81a43 Compare April 15, 2026 14:36
@seanses seanses force-pushed the di/use-hf-xet-in-hf-xet branch from 5f9b2ef to ed125ba Compare April 16, 2026 03:42
@seanses seanses marked this pull request as ready for review April 16, 2026 06:29
Comment thread hf_xet/src/lib.rs
Comment thread hf_xet/src/py_upload_commit.rs
Comment thread .github/workflows/ci.yml
Comment thread hf_xet/src/lib.rs Outdated
Copy link
Copy Markdown
Collaborator

@hoytak hoytak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Largely looks good to me, especially as a first pass. I think there are some things that we could tighten up:

  • I think we should drop the builder pattern as python supports optional keyword parameters, so the API for a creating things could just replace the current with_XXX calls in the builder struct with optional keyword arguments to the main constructor. E.g. session.new_download_group().with_endpoint("my endpoint").build() becomes session.new_download_group(endpoint = "my endpoint") instead, with all the arguments being optional.

  • The repr methods are really minimal, especially with download and progress reporting; normally str is meant to be the quick readable version and repr is meant to give a detailed and more complete representation. I think we should have both, filling in more fields in the repr stuff.

  • The process changing on PID stuff can be tightened up. Now that it's session based, I think the best way to do this is just raise a RuntimeError inside all of the XetRuntime bridge methods if the pid changed. This can be handled just in XetRuntime and that should be sufficient for all our needs.

Copy link
Copy Markdown
Collaborator

@rajatarya rajatarya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice to see the object-oriented API land — the with session.new_upload_commit()...build() as commit: flow is a big ergonomic upgrade over the old flat functions, and the split between XetUploadCommit/XetFileDownloadGroup/XetDownloadStreamGroup maps cleanly onto how callers actually think about batching. The PyO3 class layering, legacy module shim with DeprecationWarning, and the fork-safe XetRuntime::drop fix are all solid. Tests look thorough against the local:// endpoint. I won't re-litigate things hoytak (builder-pattern, __repr__ richness, PID handling in XetRuntime rather than per-session) or the cursor bot (fd limit removal, stream-upload progress visibility, panic propagation in blocking_call_with_signal_check) already flagged — I agree with those.

A handful of things I'd like to see addressed or discussed before this merges:

Things to address

  1. upload_stream doesn't release the GIL (py_upload_commit.rs:326). upload_file and upload_bytes both wrap *_blocking in py.detach(); upload_stream calls self.inner.upload_stream_blocking(...) directly. If this method does any I/O or awaits a queue slot, it'll block the interpreter for the duration. Either make it consistent with siblings or document why streaming setup is cheap enough to skip.

  2. Progress callback misses the terminal tick (py_upload_commit.rs:187-193, py_file_download_group.rs:175-181). Once is_terminal is true, the loop breaks without delivering one last progress snapshot. Callers that drive a progress bar (the docstring example with tqdm) will freeze at whatever sub-terminal value the previous tick saw, not at 100%. Consider calling callback.call1(py, ...) one final time with the terminal state before breaking.

  3. Streaming download __next__ holds the GIL while waiting (py_download_stream_handle.rs:47, 105). The inline comment is honest about it, but this has two real consequences worth reconsidering: (a) the progress thread registered on a XetUploadCommit / XetFileDownloadGroup can't run if another Python thread is looping on an iterator from the same session (common in a ThreadPoolExecutor-style consumer); (b) Ctrl-C won't be observed until the next chunk arrives, so a stalled download becomes uninterruptible from Python. XetDownloadStream not being Clone is a xet_pkg thing — could the handle be made clonable (or expose a blocking_next that takes &Arc<Self>) so py.detach() is possible? This is the one part of the API where a large download hang will feel non-Pythonic.

  4. status() returns &'static str (py_xet_session.rs:77, and the same pattern on every handle/commit/group). This is a public API. Stringly-typed states are hard to match on from Python (if session.status() == "Running" with no autocomplete, no typos caught), and the Error variant's message is silently lifted into an exception — callers that want to inspect state without raising can't. I'd really like this to be either a PyO3 enum or a small XetTaskState class with factory constants. This felt like a public-API issue so not marking as a nit.

  5. __exit__ silently drops abort() errors on the exception path (py_upload_commit.rs:250, py_file_download_group.rs:237). If abort() itself errors while an exception is already in flight, we swallow it with let _ =. If abort-on-exception ever fails for a real reason (e.g. runtime already shut down), the original exception surfaces fine but we lose the observability. A tracing::warn! around the discarded error would at least leave a breadcrumb.

  6. XetSession::abort() vs sigint_abort() aren't distinguished clearly (py_xet_session.rs:84, 89). The docstrings say "Cancel all active operations on this session. The session remains usable after abort" vs "SIGINT-style abort: shuts down the runtime and cancels all tasks." It's not obvious from Python when to reach for which — worth a sentence on each explicitly contrasting the other ("unlike sigint_abort, the session remains usable" / "destroys the runtime; session is unusable afterwards").

Questions

  • with_token_refresh_url(url, headers) bakes in the {accessToken, exp, casUrl} response shape. That's fine for the Hub, but hf_xet's Python surface is becoming a public-ish API (pipe for huggingface_hub, but also surfaced via import hf_xet). Worth renaming to something Hub-flavoured (with_hf_token_refresh_url?) or documenting that the JSON schema is part of the contract?
  • XetSession() has no constructor arguments. Is there a design intent that all tunables live on the commit/group builder, or is session-level config (cache dir, concurrency limits) just not plumbed yet?
  • hash_files is the only legacy function kept without DeprecationWarning. Is that because huggingface_hub still needs it, or does it have a place in the new surface too?

Overall this is a good shape for a 1.x story — once the above are resolved I think this is ready to ship.

Comment thread hf_xet/src/py_upload_commit.rs Outdated
Comment thread hf_xet/src/py_upload_commit.rs Outdated
Comment thread hf_xet/src/py_upload_commit.rs Outdated
Comment thread hf_xet/src/py_download_stream_handle.rs
Comment thread hf_xet/src/py_xet_session.rs Outdated
Comment thread hf_xet/src/py_xet_session.rs
Comment thread hf_xet/src/py_upload_commit.rs Outdated
@Wauplin
Copy link
Copy Markdown
Collaborator

Wauplin commented Apr 23, 2026

Thanks for exposing the XetSession! This is awesome and will likely unlock very nice use cases (I especially look forward the "upload bytes from stream" feature).

Small nit regarding the interface, would it be possible to have a context manager for the upload stream? i.e. instead of (or in addition of)

    stream = commit.upload_stream(name="big.bin")
    for chunk in produce_chunks():
        stream.write(chunk)
    stream.finish()  # must be called before the with-block exits

be able to do

    with commit.upload_stream(name="big.bin") as stream:
        for chunk in produce_chunks():
            stream.write(chunk)

=> no need for explicit stream.finish() call

@Wauplin
Copy link
Copy Markdown
Collaborator

Wauplin commented Apr 23, 2026

Made some other comments in huggingface/huggingface_hub#4116 :)

Comment thread hf_xet/src/py_upload_commit.rs
Comment thread hf_xet/src/py_upload_commit.rs
Comment thread hf_xet/tests/conftest.py
@seanses
Copy link
Copy Markdown
Collaborator Author

seanses commented Apr 30, 2026

with_token_refresh_url(url, headers) bakes in the {accessToken, exp, casUrl} response shape. That's fine for the Hub, but hf_xet's Python surface is becoming a public-ish API (pipe for huggingface_hub, but also surfaced via import hf_xet). Worth renaming to something Hub-flavoured (with_hf_token_refresh_url?) or documenting that the JSON schema is part of the contract?

Documented that the JSON schema is part of the contract.

XetSession() has no constructor arguments. Is there a design intent that all tunables live on the commit/group builder, or is session-level config (cache dir, concurrency limits) just not plumbed yet?

Session level config plumbed through in commit 0676fd4

hash_files is the only legacy function kept without DeprecationWarning. Is that because huggingface_hub still needs it, or does it have a place in the new surface too?

hash_files is a function involving only CPU and local disk read, there's no plan to move it into the new interface yet. Will keep it there until there's the necessity.

Comment thread hf_xet/src/py_upload_commit.rs
@seanses
Copy link
Copy Markdown
Collaborator Author

seanses commented Apr 30, 2026

Small nit regarding the interface, would it be possible to have a context manager for the upload stream?

Added in #825

Copy link
Copy Markdown
Collaborator

@rajatarya rajatarya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the careful pass — looking through the changes since 5d3aa3c6, every actionable item from my prior review is addressed:

Prior item Resolution
#1 upload_stream not releasing GIL Now wrapped in py.detach(...) (py_upload_commit.rs:298), consistent with upload_file / upload_bytes.
#2 Missing terminal-tick progress callback Progress thread now checks is_terminal after firing the callback for the snapshot, so the final 100% tick lands (py_upload_commit.rs:117-130).
#3 __next__ holds GIL on stream download Acknowledged and deferred to a follow-up as an optimization. The constraint (XetDownloadStream not being Clone in xet_pkg) is real, so this is reasonable to land separately.
#4 status() returning &'static str Replaced with PyXetTaskState enum (lib.rs:46-53). PyO3's mixed-variant restriction is documented inline, and the Error variant is correctly surfaced as a raised exception via task_state_to_pystate.
#5 __exit__ swallowing abort() errors tracing::warn! added on the exception path (py_upload_commit.rs:208-210).
#6 abort() vs sigint_abort() confusion Resolved by removing XetSession::abort() from the Python surface entirely; sigint_abort docstring now contrasts with the per-handle abort (py_xet_session.rs:269-280).
nit UniqueID/UniqueId mismatch Split into #824.

A few additional things I particularly liked in this iteration:

  • PyXetConfig API — the with_config(name, value) / with_config({...}) overload, dict-like __getitem__ / keys() / items() / iteration, and the immutable-update-returning-new-instance pattern is exactly the right Python ergonomics. Nicely isolated in hf_xet/src/config.rs so XetConfig's Rust shape can evolve without churning the Python surface.
  • Sha256 sentinels (COMPUTE_SHA256 / SKIP_SHA256 as zero-data #[pyclass(frozen)] types + module-level constants) is much more Pythonic than the prior policy enum, and parse_sha256 accepts None | sentinel | str cleanly.
  • Rename to XetSession.new_upload_commit / new_file_download_group / new_download_stream_group reads better than the older builder-pattern naming.
  • __repr__ on PyXetSession including config dump is useful for debugging — and the test suite's test_session.py / test_config.py give good shape coverage.
  • The legacy module shim with DeprecationWarning keeps huggingface_hub's pinned >=1.4.3,<2.0.0 range valid, which was the right call coming out of the Slack discussion.

The doc-string :meth:/:class: cross-refs are consistent throughout, the kwargs-only signatures (#[pyo3(signature = (...))]) match the Python convention, and the GIL discipline (everywhere except the deferred download-stream __next__) is now uniform.

This is a big surface and a clean landing. Ship it after #824 and #825 stack in. 🚀

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit dc478ab. Configure here.

Comment thread hf_xet/src/py_upload_commit.rs
@seanses seanses merged commit 23ec294 into main May 1, 2026
10 checks passed
@seanses seanses deleted the di/use-hf-xet-in-hf-xet branch May 1, 2026 10:05
seanses added a commit that referenced this pull request May 1, 2026
In response to
#792 (comment),
this removes the \`UniqueID\` type alias that re-exported
\`xet_runtime::utils::UniqueId\` under a screaming-snake-case name from
\`xet_data::progress_tracking\`. This type alias is unnecessary and
caused confusion for reviewers (both human beings and agents).
seanses added a commit that referenced this pull request May 2, 2026
Addresses the review
[comment](#792 (comment))
on #792: adds a context manager to `XetStreamUpload` so callers no
longer need an explicit `finish()` call.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants