Conversation
…ng upload, progress interval, and integration tests
c65cf6e to
b3bb567
Compare
…, XetFileInfo Python constructor, and split tests by feature
85873ea to
1d81a43
Compare
5f9b2ef to
ed125ba
Compare
5240995 to
1604623
Compare
hoytak
left a comment
There was a problem hiding this comment.
Largely looks good to me, especially as a first pass. I think there are some things that we could tighten up:
-
I think we should drop the builder pattern as python supports optional keyword parameters, so the API for a creating things could just replace the current with_XXX calls in the builder struct with optional keyword arguments to the main constructor. E.g.
session.new_download_group().with_endpoint("my endpoint").build()becomessession.new_download_group(endpoint = "my endpoint")instead, with all the arguments being optional. -
The repr methods are really minimal, especially with download and progress reporting; normally str is meant to be the quick readable version and repr is meant to give a detailed and more complete representation. I think we should have both, filling in more fields in the repr stuff.
-
The process changing on PID stuff can be tightened up. Now that it's session based, I think the best way to do this is just raise a RuntimeError inside all of the XetRuntime bridge methods if the pid changed. This can be handled just in XetRuntime and that should be sufficient for all our needs.
rajatarya
left a comment
There was a problem hiding this comment.
Really nice to see the object-oriented API land — the with session.new_upload_commit()...build() as commit: flow is a big ergonomic upgrade over the old flat functions, and the split between XetUploadCommit/XetFileDownloadGroup/XetDownloadStreamGroup maps cleanly onto how callers actually think about batching. The PyO3 class layering, legacy module shim with DeprecationWarning, and the fork-safe XetRuntime::drop fix are all solid. Tests look thorough against the local:// endpoint. I won't re-litigate things hoytak (builder-pattern, __repr__ richness, PID handling in XetRuntime rather than per-session) or the cursor bot (fd limit removal, stream-upload progress visibility, panic propagation in blocking_call_with_signal_check) already flagged — I agree with those.
A handful of things I'd like to see addressed or discussed before this merges:
Things to address
-
upload_streamdoesn't release the GIL (py_upload_commit.rs:326).upload_fileandupload_bytesboth wrap*_blockinginpy.detach();upload_streamcallsself.inner.upload_stream_blocking(...)directly. If this method does any I/O or awaits a queue slot, it'll block the interpreter for the duration. Either make it consistent with siblings or document why streaming setup is cheap enough to skip. -
Progress callback misses the terminal tick (py_upload_commit.rs:187-193, py_file_download_group.rs:175-181). Once
is_terminalis true, the loop breaks without delivering one last progress snapshot. Callers that drive a progress bar (the docstring example with tqdm) will freeze at whatever sub-terminal value the previous tick saw, not at 100%. Consider callingcallback.call1(py, ...)one final time with the terminal state before breaking. -
Streaming download
__next__holds the GIL while waiting (py_download_stream_handle.rs:47, 105). The inline comment is honest about it, but this has two real consequences worth reconsidering: (a) the progress thread registered on aXetUploadCommit/XetFileDownloadGroupcan't run if another Python thread is looping on an iterator from the same session (common in aThreadPoolExecutor-style consumer); (b) Ctrl-C won't be observed until the next chunk arrives, so a stalled download becomes uninterruptible from Python.XetDownloadStreamnot beingCloneis axet_pkgthing — could the handle be made clonable (or expose ablocking_nextthat takes&Arc<Self>) sopy.detach()is possible? This is the one part of the API where a large download hang will feel non-Pythonic. -
status()returns&'static str(py_xet_session.rs:77, and the same pattern on every handle/commit/group). This is a public API. Stringly-typed states are hard to match on from Python (if session.status() == "Running"with no autocomplete, no typos caught), and theErrorvariant's message is silently lifted into an exception — callers that want to inspect state without raising can't. I'd really like this to be either a PyO3 enum or a smallXetTaskStateclass with factory constants. This felt like a public-API issue so not marking as a nit. -
__exit__silently dropsabort()errors on the exception path (py_upload_commit.rs:250, py_file_download_group.rs:237). Ifabort()itself errors while an exception is already in flight, we swallow it withlet _ =. If abort-on-exception ever fails for a real reason (e.g. runtime already shut down), the original exception surfaces fine but we lose the observability. Atracing::warn!around the discarded error would at least leave a breadcrumb. -
XetSession::abort()vssigint_abort()aren't distinguished clearly (py_xet_session.rs:84, 89). The docstrings say "Cancel all active operations on this session. The session remains usable after abort" vs "SIGINT-style abort: shuts down the runtime and cancels all tasks." It's not obvious from Python when to reach for which — worth a sentence on each explicitly contrasting the other ("unlikesigint_abort, the session remains usable" / "destroys the runtime; session is unusable afterwards").
Questions
with_token_refresh_url(url, headers)bakes in the{accessToken, exp, casUrl}response shape. That's fine for the Hub, but hf_xet's Python surface is becoming a public-ish API (pipe forhuggingface_hub, but also surfaced viaimport hf_xet). Worth renaming to something Hub-flavoured (with_hf_token_refresh_url?) or documenting that the JSON schema is part of the contract?XetSession()has no constructor arguments. Is there a design intent that all tunables live on the commit/group builder, or is session-level config (cache dir, concurrency limits) just not plumbed yet?hash_filesis the only legacy function kept withoutDeprecationWarning. Is that becausehuggingface_hubstill needs it, or does it have a place in the new surface too?
Overall this is a good shape for a 1.x story — once the above are resolved I think this is ready to ship.
|
Thanks for exposing the XetSession! This is awesome and will likely unlock very nice use cases (I especially look forward the "upload bytes from stream" feature). Small nit regarding the interface, would it be possible to have a context manager for the upload stream? i.e. instead of (or in addition of) stream = commit.upload_stream(name="big.bin")
for chunk in produce_chunks():
stream.write(chunk)
stream.finish() # must be called before the with-block exitsbe able to do with commit.upload_stream(name="big.bin") as stream:
for chunk in produce_chunks():
stream.write(chunk)=> no need for explicit |
|
Made some other comments in huggingface/huggingface_hub#4116 :) |
Documented that the JSON schema is part of the contract.
Session level config plumbed through in commit 0676fd4
|
Added in #825 |
rajatarya
left a comment
There was a problem hiding this comment.
Thanks for the careful pass — looking through the changes since 5d3aa3c6, every actionable item from my prior review is addressed:
| Prior item | Resolution |
|---|---|
#1 upload_stream not releasing GIL |
Now wrapped in py.detach(...) (py_upload_commit.rs:298), consistent with upload_file / upload_bytes. |
| #2 Missing terminal-tick progress callback | Progress thread now checks is_terminal after firing the callback for the snapshot, so the final 100% tick lands (py_upload_commit.rs:117-130). |
#3 __next__ holds GIL on stream download |
Acknowledged and deferred to a follow-up as an optimization. The constraint (XetDownloadStream not being Clone in xet_pkg) is real, so this is reasonable to land separately. |
#4 status() returning &'static str |
Replaced with PyXetTaskState enum (lib.rs:46-53). PyO3's mixed-variant restriction is documented inline, and the Error variant is correctly surfaced as a raised exception via task_state_to_pystate. |
#5 __exit__ swallowing abort() errors |
tracing::warn! added on the exception path (py_upload_commit.rs:208-210). |
#6 abort() vs sigint_abort() confusion |
Resolved by removing XetSession::abort() from the Python surface entirely; sigint_abort docstring now contrasts with the per-handle abort (py_xet_session.rs:269-280). |
nit UniqueID/UniqueId mismatch |
Split into #824. |
A few additional things I particularly liked in this iteration:
PyXetConfigAPI — thewith_config(name, value)/with_config({...})overload, dict-like__getitem__/keys()/items()/ iteration, and the immutable-update-returning-new-instance pattern is exactly the right Python ergonomics. Nicely isolated inhf_xet/src/config.rssoXetConfig's Rust shape can evolve without churning the Python surface.- Sha256 sentinels (
COMPUTE_SHA256/SKIP_SHA256as zero-data#[pyclass(frozen)]types + module-level constants) is much more Pythonic than the prior policy enum, andparse_sha256acceptsNone | sentinel | strcleanly. - Rename to
XetSession.new_upload_commit/new_file_download_group/new_download_stream_groupreads better than the older builder-pattern naming. __repr__onPyXetSessionincluding config dump is useful for debugging — and the test suite'stest_session.py/test_config.pygive good shape coverage.- The legacy module shim with
DeprecationWarningkeepshuggingface_hub's pinned>=1.4.3,<2.0.0range valid, which was the right call coming out of the Slack discussion.
The doc-string :meth:/:class: cross-refs are consistent throughout, the kwargs-only signatures (#[pyo3(signature = (...))]) match the Python convention, and the GIL discipline (everywhere except the deferred download-stream __next__) is now uniform.
This is a big surface and a clean landing. Ship it after #824 and #825 stack in. 🚀
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit dc478ab. Configure here.
In response to #792 (comment), this removes the \`UniqueID\` type alias that re-exported \`xet_runtime::utils::UniqueId\` under a screaming-snake-case name from \`xet_data::progress_tracking\`. This type alias is unnecessary and caused confusion for reviewers (both human beings and agents).
Addresses the review [comment](#792 (comment)) on #792: adds a context manager to `XetStreamUpload` so callers no longer need an explicit `finish()` call.

Summary
Replaces the old
upload_files/download_files/hash_filesPython functions with a new object-oriented API that exposesXetSessionand its child objects directly as PyO3 classes. This gives Python callers full control over session lifecycle, connection pooling, and progress reporting.The previous module-level functions are kept under
hf_xet/src/legacy/and remain importable asfrom hf_xet import upload_filesetc., but now emitDeprecationWarning.New Python API
Files Changed
New files
hf_xet/src/py_xet_session.rs—XetSessionPyO3 classhf_xet/src/py_upload_commit.rs—XetUploadCommit, SHA-256 sentinels (COMPUTE_SHA256,SKIP_SHA256), report typeshf_xet/src/py_file_upload_handle.rs—XetFileUploadhf_xet/src/py_stream_upload_handle.rs—XetStreamUploadhf_xet/src/py_file_download_group.rs—XetFileDownloadGrouphf_xet/src/py_file_download_handle.rs—XetFileDownloadhf_xet/src/py_download_stream_group.rs—XetDownloadStreamGrouphf_xet/src/py_download_stream_handle.rs—XetDownloadStream,XetUnorderedDownloadStreamhf_xet/src/config.rs—XetConfigPython class (immutable,with_config(),get(),items(),keys(),__getitem__)hf_xet/src/headers.rs—build_headers_with_user_agenthelperhf_xet/src/legacy/mod.rs— re-exports all legacy symbolshf_xet/src/legacy/types.rs—PyXetDownloadInfo,PyXetUploadInfo,PyPointerFilehf_xet/src/legacy/functions.rs— deprecatedupload_bytes,upload_files,download_files,force_sigint_shutdown;hash_filesretained without deprecationhf_xet/src/legacy/progress_update.rs—PyItemProgressUpdate,PyTotalProgressUpdate,WrappedProgressUpdaterhf_xet/src/legacy/runtime.rs— async runtime + SIGINT handler (used by legacy functions)hf_xet/src/legacy/token_refresh.rs— Python callback token refresher (used by legacy functions)hf_xet/tests/conftest.py— shared fixtures and upload helpershf_xet/tests/test_upload_commit.py— upload tests (file, bytes, stream, SHA-256 policy, progress, abort)hf_xet/tests/test_file_download.py— file download tests (handles, round-trips, progress, cancel)hf_xet/tests/test_stream_download.py— ordered and unordered streaming download tests with range variantshf_xet/tests/test_progress.py— progress callback argument types and field verificationhf_xet/tests/test_session.py—XetSessionlifecycle and group/commit creation testshf_xet/tests/test_config.py—XetConfigconstruction,with_config,get,items,keysModified files
hf_xet/src/lib.rs— module declarations;XetTaskStatePython enum;blocking_call_with_signal_checkutility; legacy module registered at top level for backward compatibilityhf_xet/src/logging.rs— callsxet_pkg::init_logging()instead ofxet_runtimedirectlyhf_xet/Cargo.toml— addedxet-runtime,xet-clientdeps (for legacy module); feature flags route throughxet-pkgxet_pkg/src/xet_session/file_download_group.rs— exposesXetDownloadGroupReportas a Python class (pyclass(get_all),__repr__)xet_data/src/processing/xet_file.rs—#[new]Python constructor forXetFileInfoxet_pkg/Cargo.toml— addedno-default-cache,tokio-console,elevated_information_levelfeaturesxet_pkg/src/lib.rs— addedinit_logging()wrapperxet_runtime/src/core/runtime.rs— fork-safeDrop: detect child process via stored PID, discard runtime instead of blocking shutdown.github/workflows/ci.yml— added Python integration test step (maturin + pytest) to Linux, Windows, macOS jobsTest Plan
cargo test --verbose --no-fail-fastinhf_xet/maturin develop && pytest hf_xet/tests/ -vhuggingface_hubupload/download flows against staging: "Xet" test in Use the new XetSession API huggingface_hub#4116; manually tested progress update and Ctrl-C handlingDesign Notes
API style: groups and commits are created with keyword arguments directly on the factory method (
session.new_upload_commit(endpoint=..., token_refresh_url=..., progress_callback=...)) rather than a builder chain. The Rust builder is used internally and never surfaces in the Python API.Token refresh: the old API required Python to pass a token-refresh callable that Rust invoked across the GIL boundary. The new API uses
token_refresh_url+token_refresh_headers— Rust refreshes autonomously via HTTP, removing GIL re-entry on the hot path.WrappedTokenRefresheris kept only inlegacy/.SHA-256 policy:
start_upload_file,start_upload_bytes, andstart_upload_streamaccept asha256argument that is either a pre-computed hex string,COMPUTE_SHA256(default), orSKIP_SHA256. Sentinel objects have descriptive__repr__for debugging.XetConfig: immutable Python class wrapping
XetConfig.with_config(name, value)orwith_config({...})returns a new config with updates applied. Supports dotted-path access (config["data.max_concurrent_file_ingestion"],config.get("data.max_concurrent_file_ingestion")),items(),keys(), and iteration.XetTaskState: a Python class with variants
Running,Finalizing,Completed,UserCancelled. TheErrorvariant of the internal Rust enum surfaces as a raised Python exception rather than an enum value (PyO3 0.26 does not support mixed unit/complex enum variants in#[pyclass]).Progress callbacks:
progress_callbackspawns a background thread that delivers(GroupProgressReport, dict[UniqueID, ItemProgressReport])to the Python callable everyprogress_interval_msmilliseconds (default 100). The same signature covers both upload and download groups.GIL release and Ctrl-C: queue operations (
start_upload_file,start_upload_bytes,start_download_file) usepy.detach()and return quickly. Long-wait operations (wait_to_finish()) run the blocking call on a background thread while the calling thread releases the GIL for 100 ms windows and pollspy.check_signals()— Ctrl-C raisesKeyboardInterruptwithin one interval without starving other Python threads.XetError::KeyboardInterruptmaps toPyKeyboardInterrupt. The recommended caller pattern isexcept KeyboardInterrupt: session.sigint_abort(); raise, which is idiomatic Python:sigint_abort()flags the runtime so the background thread exits cleanly at its next checkpoint.Context managers and concurrency:
XetUploadCommitandXetFileDownloadGroupimplement__enter__/__exit__;__exit__delegates towait_to_finish()on success andabort()on exception. Multiplestart_upload_file/start_download_filecalls within awithblock run concurrently; the block exit waits for all to complete.XetDownloadStreamGroupis not a context manager — streams are opened withdownload_stream()/download_unordered_stream()and iterated directly.Streaming uploads:
commit.start_upload_stream()returns aXetStreamUploadhandle for incremental writes (.write(bytes), then.finish()beforewait_to_finish()).Streaming downloads:
download_streamanddownload_unordered_streamaccept optionalstart/endbyte offsets; either may be omitted independently.download_unordered_streamyields(offset, bytes)tuples in completion order.Fork-safe runtime drop:
XetRuntimerecords its creating PID; ifdropfires in a child process afterfork, the parent's Tokio threads don't exist soshutdown_timeout()would block. The runtime is discarded viamem::forgetinstead.Backward compatibility: all pre-1.x functions (
upload_bytes,upload_files,hash_files,download_files,force_sigint_shutdown) and types (PyXetDownloadInfo,PyXetUploadInfo,PyPointerFile,PyItemProgressUpdate,PyTotalProgressUpdate) remain importable from the top-levelhf_xetmodule. Deprecated functions emitDeprecationWarningatstacklevel=2.hash_filesis not deprecated.Note
Medium Risk
Medium risk because it reworks the Python-facing API surface and adds new concurrency/progress-callback behavior around uploads/downloads, which could introduce behavioral regressions despite extensive tests.
Overview
Adds a new object-oriented Python API around
XetSession(plus upload commits, file-download groups, and ordered/unordered download streams), including progress callbacks, SHA-256 policy sentinels, and signal-friendly blocking waits.Moves the pre-1.x module-level functions/types into a
legacymodule, keeps them importable for compatibility, and makes the main entry points emitDeprecationWarningwhile sharing header/User-Agent handling and centralized error conversion.Extends
xet_pkg/xet_datawith Python-facingpyclassreport/value types and minor API tweaks (e.g., optional stream progress, cloneable handles, active task info), bumpshf_xetto1.5.0, and updates CI to build the extension viamaturinand run new pytest integration tests on Linux/Windows/macOS.Reviewed by Cursor Bugbot for commit dc478ab. Bugbot is set up for automated code reviews on this repo. Configure here.