Use the new XetSession API#4116
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4116 +/- ##
==========================================
+ Coverage 75.00% 77.26% +2.26%
==========================================
Files 145 170 +25
Lines 13978 19469 +5491
==========================================
+ Hits 10484 15043 +4559
- Misses 3494 4426 +932 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 19efd09. Configure here.
Wauplin
left a comment
There was a problem hiding this comment.
This looks very solid now! Extensively tested it locally (mainly the CLI) and works as smoothly as before 👍
|
|
||
| # hf-xet version used in both install_requires and extras["hf_xet"] | ||
| HF_XET_VERSION = "hf-xet>=1.4.3,<2.0.0" | ||
| HF_XET_VERSION = "hf-xet>=1.5.0.dev3,<2.0.0" |
There was a problem hiding this comment.
Let's not merge this PR until v1.5.0 is officially released.
| XetSessionHolder, | ||
| XetTokenType, | ||
| fetch_xet_connection_info_from_repo_info, | ||
| get_xet_session, |
There was a problem hiding this comment.
| XetSessionHolder, | |
| XetTokenType, | |
| fetch_xet_connection_info_from_repo_info, | |
| get_xet_session, | |
| XetTokenType, | |
| fetch_xet_connection_info_from_repo_info, |
for now let's not make them public. We can always add them later if a downstream library requires a direct access to the xet session but I doubt.
(you always import them with from .utils._xet import ... in other modules so it shouldn't break anything)
| if new_xet_connection is None: | ||
| raise XetRefreshTokenError("Failed to refresh xet token") | ||
| return new_xet_connection.access_token, new_xet_connection.expiration_unix_epoch | ||
| refresh_url = f"{self.endpoint}/api/buckets/{bucket_id}/xet-write-token" |
There was a problem hiding this comment.
is this change intentional? (i.e. semi-hardcoding the URL) If not let's replace it with something like
refresh_url = xet_connection_info_refresh_url(
token_type=XetTokenType.WRITE,
repo_id=bucket_id,
repo_type="bucket",
endpoint=self.endpoint,
)| repo_id (`str`): | ||
| A namespace (user or an organization) and a repo name separated by a `/`. | ||
| repo_type (`str`): | ||
| Type of the repo to upload to: `"model"`, `"dataset"` or `"space"`. |
There was a problem hiding this comment.
| Type of the repo to upload to: `"model"`, `"dataset"` or `"space"`. | |
| Type of the repo to upload to. |
(handles the new "kernels" and "buckets" as well)
There was a problem hiding this comment.
(small refacto seems to work well in my local examples but I have to admit I did not carefully reviewed it - looks fine)
| for op in all_bytes_ops: | ||
| commit.start_upload_bytes(op.path_or_fileobj, sha256=op.upload_info.sha256.hex()) | ||
| except KeyboardInterrupt: | ||
| _GLOBAL_XET_HOLDER.sigint_abort() |
There was a problem hiding this comment.
since we have get_xet_session, I'm not a fan of having to import the low-level _GLOBAL_XET_HOLDER outside of the _xet module. Could you define a new utility like abort_xet_session which would be an alias for _GLOBAL_XET_HOLDER.sigint_abort()? This way we kinda centralize the logic around _GLOBAL_XET_HOLDER
| for xet_info, dest in non_zero_download_items: | ||
| group.start_download_file(xet_info, dest) | ||
| except KeyboardInterrupt: | ||
| _GLOBAL_XET_HOLDER.sigint_abort() |
|
I've asked Claude code to review the PR and it commented a few things. Nothing critical IMO
Suggestion: Delete My opinion (Wauplin): fine to remove them and mention the breaking change in the PR description (it's a breaking change but I don't expect downstream users to have actually raised from it)
My opinion (Wauplin): quick and easy to test so let's do it (even if not introduced in this PR)
My opinion (Wauplin): let's keep it as-is except if you really care about it^^
My opinion (Wauplin): why not but low prio |

Summary
Migrates all xet upload/download code from the old function-based
hf_xetAPI (upload_files,download_files, globalMULTITHREADED_RUNTIME) to the new session-basedXetSessionAPI introduced inhf-xet >= 1.5.0.dev0.Upload:
Download:
Architecture:
utils/_xet.py—XetSessionHolderis a lightweight wrapper aroundXetSessionwith three safety guarantees:sigint_abort()cancels the active session and clears the reference so the nextget()creates a fresh one (notebook-friendly — subsequent cells still work after Ctrl-C)get()checksos.getpid()against the PID that created the session; on mismatch the inherited session is discarded and a fresh one is created for the child process (prevents deadlock when multiprocessing workers inherit a dead Tokio runtime)threading.Lockguards all state mutations, important for free-threaded Python (3.14t) where multiple threads can race onget()orsigint_abort()without the GIL serialising them_GLOBAL_XET_HOLDER = XetSessionHolder()is a module-level singleton (analogous toget_session()for HTTP). All upload/download call sites share the same session viaget_xet_session()— repo commits,hf_hub_download, bucket uploads/downloads, andsnapshot_downloadall reuse the same underlying Tokio runtimeexcept KeyboardInterrupt: _GLOBAL_XET_HOLDER.sigint_abort(); raiseto abort the in-flight Rust operation and allow clean re-use of xet in the same processxet_connection_info_refresh_url()centralises token-refresh URL construction: handles thebucketrepo-type (revision omitted) and the/Nonerevision for PR write-token requests (where the final revision is not yet known)_xet_progress_reporting.py— progress callback receives(group_report, item_reports)matching the new API contractsetup.py— minimumhf-xetversion bumped to>= 1.5.0.dev3hf_xet.XetSession; fork-safety unit and integration tests added totest_xet_utils.pyNote on
upload_large_folderand Ctrl-C: xet work there runs inside background worker threads. Python only deliversKeyboardInterruptto the main thread, andPyErr_CheckSignals()is a no-op in non-main threads — so Ctrl-C responsiveness forupload_large_folderrequires a separate fix (shared stop event + worker cooperation) and is out of scope for this PR.TODO
upload_large_folderto use theXetSessionAPI directly: register each file to the upload commit object immediately after SHA-256 computation (no batching needed — xet handles concurrency internally), then callcommit()once at the end. This removes the artificial batch size tuning (UPLOAD_BATCH_SIZE_XET) and makes the xet path inupload_large_folderconsistent with the rest of the codebase.