feat: expose CAS client factory and optional chunk cache#675
feat: expose CAS client factory and optional chunk cache#675
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends the download/reconstruction pipeline to optionally use an on-disk ChunkCache for xorb data, and exposes client/cache-related APIs from the data crate to better support mount-style downstream consumers.
Changes:
- Thread an optional
ChunkCachethroughFileDownloadSession→FileReconstructor→FileTerm→XorbBlock, using it togetbefore network fetch andputafter download. - Expose CAS client creation helpers publicly via
data(remote_client_interfacemade public andcreate_remote_clientre-exported), and re-exportChunkCache/CacheConfig/get_cache. - Add
chunk_cacheas a dependency where needed and update call sites/tests accordingly.
Reviewed changes
Copilot reviewed 11 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| file_reconstruction/src/reconstruction_terms/xorb_block.rs | Adds cache lookup + best-effort cache population around xorb block retrieval. |
| file_reconstruction/src/reconstruction_terms/file_term.rs | Passes optional cache down to xorb retrieval path. |
| file_reconstruction/src/file_reconstructor.rs | Stores optional cache and forwards it to term download tasks. |
| file_reconstruction/Cargo.toml | Adds chunk_cache dependency. |
| data/src/file_download_session.rs | Adds session-level optional cache and wires it into the reconstructor. |
| data/src/lib.rs | Re-exports CAS client/cache types and publicly re-exports client creation function. |
| data/src/remote_client_interface.rs | Makes create_remote_client public. |
| data/src/data_client.rs | Updates session construction for new signature. |
| data/src/bin/example.rs | Updates example to use new session constructor signature. |
| data/src/file_upload_session.rs | Updates tests that create a download session. |
| data/src/test_utils.rs | Minor doc/comment and call-site updates for new API. |
| data/Cargo.toml | Adds chunk_cache dependency. |
| Cargo.lock | Locks new dependency edges. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pub async fn new( | ||
| config: Arc<TranslatorConfig>, | ||
| progress_updater: Option<Arc<dyn TrackingProgressUpdater>>, | ||
| chunk_cache: Option<Arc<dyn ChunkCache>>, | ||
| ) -> Result<Arc<Self>> { |
There was a problem hiding this comment.
FileDownloadSession::new/from_client are public APIs and their signatures changed by adding a new chunk_cache parameter. This is a breaking change for downstream crates, which conflicts with the PR description’s “backwards-compatible” claim. If backwards compatibility is required, consider keeping the old signatures (delegating to new *_with_chunk_cache variants) or adding defaulted convenience constructors so existing callers don’t need to change.
| pub use cas_client::Client as CasClient; | ||
| pub use chunk_cache::{CacheConfig, ChunkCache, get_cache}; | ||
| pub use deduplication::RawXorbData; | ||
| pub use file_download_session::FileDownloadSession; | ||
| pub use file_reconstruction::DownloadStream; | ||
| pub use file_upload_session::FileUploadSession; | ||
| pub use remote_client_interface::create_remote_client; |
There was a problem hiding this comment.
The PR description references a new data::create_client() API, but this module publicly re-exports create_remote_client instead. If create_client is the intended stable entry point for downstream consumers, consider adding a wrapper with that name/signature (e.g., deriving session_id internally) and/or updating the exports/PR description so they match.
| // Try the on-disk chunk cache before hitting the network. | ||
| if let Some(ref cache) = chunk_cache | ||
| && let Ok(Some(cache_range)) = cache.get(&cache_key, &self.chunk_range).await | ||
| { | ||
| // Report the transfer bytes as completed so progress tracking stays consistent. | ||
| if let Some(ref updater) = progress_updater { | ||
| let (_, _, http_range) = url_info.get_retrieval_url(self.xorb_block_index).await; | ||
| let file_range = cas_types::FileRange::from(http_range); | ||
| let transfer_bytes = file_range.end.saturating_sub(file_range.start); | ||
| updater.report_transfer_progress(transfer_bytes); | ||
| } | ||
| let chunk_offsets: Vec<usize> = cache_range.offsets.iter().map(|&x| x as usize).collect(); | ||
| let data = Bytes::from(cache_range.data); | ||
| let xorb_block_data = Arc::new(XorbBlockData { | ||
| chunk_offsets, | ||
| uncompressed_size: data.len() as u64, | ||
| data, | ||
| }); | ||
| *xbd_lg = Some(xorb_block_data.clone()); | ||
| return Ok(xorb_block_data); | ||
| } |
There was a problem hiding this comment.
The new chunk-cache fast path isn’t covered by tests. Since XorbBlock now supports (1) cache hit avoiding network and (2) best-effort cache population on miss, it would be good to add unit/integration coverage to verify get is consulted, put is called after download, and the returned chunk_offsets/data match expectations.
| @@ -64,7 +66,9 @@ impl FileTerm { | |||
| let xorb_block = self.xorb_block.clone(); | |||
There was a problem hiding this comment.
get_data_task always acquires a download permit before calling retrieve_data. With the new chunk_cache path, this means cache hits can still block on (and consume) the network concurrency semaphore even though no HTTP request is made. Consider restructuring so the cache is checked before acquiring the permit (e.g., move permit acquisition into XorbBlock::retrieve_data only on cache miss, or add an early cache-check method).
|
Addressed all review comments: #1 Breaking change: Updated PR description to explicitly note the breaking signature change. Only known consumers are #2 #3 Empty prefix: Fixed in cf336ae — now using #4 Tests: Acknowledged, will add in a follow-up. #5 Permit on cache hit: Fixed in cf336ae — |
|
@XciD : Can you share a bit about the context for these changes? Maybe I haven't read them carefully enough - is the goal to enable the chunk_cache on the new adaptive concurrency download/upload paths? |
|
My comment have been removed, re-adding it: I'm pocing a Fuse/NFS mount with xet for hf buckets, and those are change I need to enable xorb cache level on it. |
Ah - that makes sense! @assafvayner didn't you have a branch with this somewhere? I remember there being a demo after the offsite last year. Might make sense to resurrect that branch and build from it instead. |
|
this is my mount branch, but it's a outdates as far as as the new download interface. It does set up an nfs mount daemon though. https://github.com/huggingface/xet-core/tree/assaf/mount It uses a fork (https://github.com/huggingface/nfsserve) of the original xetdata/nfsserve library that provides scaffolding for setting up nfs v3 |
|
Thanks for sharing, did you share it somewhere? I can't find anything on Slack. Would love your thoughts on the implementation. Was done with Claude, and re-read carefully. I think I will post on Slack later this week if we can start to test it over science cluster for example. |
c4ab47c to
ed9c095
Compare
d5b1478 to
e85cf9d
Compare
Rebase of the hf-mount integration work onto the reorganized codebase. Changes: - Expose CAS client factory (create_remote_client), ChunkCache, CacheConfig, get_cache, and SingleFileCleaner from the data processing module - Thread optional chunk_cache parameter through FileDownloadSession, FileReconstructor, FileTerm, and XorbBlock - Add download_stream_from_offset for streaming from arbitrary byte offset - Add with_file_size/with_chunk_cache builders on FileReconstructor - Integrate chunk cache in XorbBlock: check cache before HTTP download, populate cache after successful download (non-blocking via tokio::spawn) - Use file_size to bound reconstruction range for full-file downloads, preventing speculative requests beyond EOF
b099c05 to
84f102e
Compare
When the final file size is unknown (streaming/append-only uploads), passing size=0 with is_final_size_known=true causes debug_assert panics when completed_bytes exceeds total_bytes=0. Changing the parameter to Option<u64> lets callers pass None, which sets is_final_size_known=false and allows incremental size updates. This change was present before the rebase but got lost during the package restructure squash.
Upgrade from rev b099c05 (8 separate crates) to rev 0d7808b (4 consolidated crates) following the xet-core package restructure: - cas_client, cas_types -> xet-client - mdb_shard, merklehash, xorb_object -> xet-core-structures - data, file_reconstruction -> xet-data - utils -> xet-runtime Also absorbs API changes from huggingface/xet-core#675: - FileDownloadSession::from_client() accepts optional ChunkCache - start_clean() takes Sha256Policy instead of Option<Sha256> - start_clean() size parameter is now Option<u64> for streaming uploads - FileReconstructor gains with_file_size() builder method
## Summary - Upgrade xet-core from `b099c05` (8 separate crates) to `0d7808b` (4 consolidated crates) following the package restructure (#693) - Absorbs API changes from huggingface/xet-core#675 (chunk cache, file size bounds, Sha256Policy) - `cas_client` + `cas_types` -> `xet-client` - `mdb_shard` + `merklehash` + `xorb_object` -> `xet-core-structures` - `data` + `file_reconstruction` -> `xet-data` - `utils` -> `xet-runtime` Depends on: huggingface/xet-core#675
…hashes-and-compose (#717) Combined branch for hf-mount that needs both CAS client factory (PR 675) and range upload/compose APIs (PR 717). Chunk cache integration from 675 is stubbed out (param accepted but unused) since the struct shapes diverged.
Summary
data::create_remote_client()made public: Allows downstream consumers to create a CAS client and wrap it with custom caching layers before passing toFileDownloadSession::from_client().Re-exports from
datacrate:CasClient,ChunkCache,CacheConfig,get_cache, andcreate_remote_clientare publicly accessible, so downstream crates don't need to depend oncas_client/chunk_cachedirectly.Optional
ChunkCachein download path:FileDownloadSession,FileReconstructor, andXorbBlockaccept an optionalChunkCache. When provided, xorb blocks are looked up in the on-disk cache before making HTTP requests, and stored after download. This enables cross-file deduplication for mount-style workloads where the same xorbs are accessed repeatedly.Lazy permit acquisition: Download permits are acquired only on chunk cache miss, so cache hits don't consume network concurrency slots.
Pass file size to
FileReconstructor: For full-file downloads,FileReconstructornow usesFileRange::new(0, file_size)instead ofFileRange::full()(0..u64::MAX). This preventsReconstructionTermManagerfrom speculatively prefetching blocks beyond EOF, which caused the CAS server to return 416 Range Not Satisfiable for small files.start_cleanacceptsOption<u64>for size: Changed thesizeparameter fromu64toOption<u64>. PassingNonesignals that the final size is unknown (streaming uploads), which preventsdebug_assertpanics whencompleted_bytesexceeds the initially declaredtotal_bytes=0.Note: this is a breaking change to
FileDownloadSession::new,from_client, andstart_cleansignatures. The only known consumers arehf_xet(in this repo) andhf-mount.