Skip to content

feat: expose CAS client factory and optional chunk cache#675

Closed
XciD wants to merge 2 commits intomainfrom
feat/hf-mount-integration
Closed

feat: expose CAS client factory and optional chunk cache#675
XciD wants to merge 2 commits intomainfrom
feat/hf-mount-integration

Conversation

@XciD
Copy link
Copy Markdown
Member

@XciD XciD commented Mar 1, 2026

Summary

  • data::create_remote_client() made public: Allows downstream consumers to create a CAS client and wrap it with custom caching layers before passing to FileDownloadSession::from_client().

  • Re-exports from data crate: CasClient, ChunkCache, CacheConfig, get_cache, and create_remote_client are publicly accessible, so downstream crates don't need to depend on cas_client/chunk_cache directly.

  • Optional ChunkCache in download path: FileDownloadSession, FileReconstructor, and XorbBlock accept an optional ChunkCache. When provided, xorb blocks are looked up in the on-disk cache before making HTTP requests, and stored after download. This enables cross-file deduplication for mount-style workloads where the same xorbs are accessed repeatedly.

  • Lazy permit acquisition: Download permits are acquired only on chunk cache miss, so cache hits don't consume network concurrency slots.

  • Pass file size to FileReconstructor: For full-file downloads, FileReconstructor now uses FileRange::new(0, file_size) instead of FileRange::full() (0..u64::MAX). This prevents ReconstructionTermManager from speculatively prefetching blocks beyond EOF, which caused the CAS server to return 416 Range Not Satisfiable for small files.

  • start_clean accepts Option<u64> for size: Changed the size parameter from u64 to Option<u64>. Passing None signals that the final size is unknown (streaming uploads), which prevents debug_assert panics when completed_bytes exceeds the initially declared total_bytes=0.

Note: this is a breaking change to FileDownloadSession::new, from_client, and start_clean signatures. The only known consumers are hf_xet (in this repo) and hf-mount.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the download/reconstruction pipeline to optionally use an on-disk ChunkCache for xorb data, and exposes client/cache-related APIs from the data crate to better support mount-style downstream consumers.

Changes:

  • Thread an optional ChunkCache through FileDownloadSessionFileReconstructorFileTermXorbBlock, using it to get before network fetch and put after download.
  • Expose CAS client creation helpers publicly via data (remote_client_interface made public and create_remote_client re-exported), and re-export ChunkCache/CacheConfig/get_cache.
  • Add chunk_cache as a dependency where needed and update call sites/tests accordingly.

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
file_reconstruction/src/reconstruction_terms/xorb_block.rs Adds cache lookup + best-effort cache population around xorb block retrieval.
file_reconstruction/src/reconstruction_terms/file_term.rs Passes optional cache down to xorb retrieval path.
file_reconstruction/src/file_reconstructor.rs Stores optional cache and forwards it to term download tasks.
file_reconstruction/Cargo.toml Adds chunk_cache dependency.
data/src/file_download_session.rs Adds session-level optional cache and wires it into the reconstructor.
data/src/lib.rs Re-exports CAS client/cache types and publicly re-exports client creation function.
data/src/remote_client_interface.rs Makes create_remote_client public.
data/src/data_client.rs Updates session construction for new signature.
data/src/bin/example.rs Updates example to use new session constructor signature.
data/src/file_upload_session.rs Updates tests that create a download session.
data/src/test_utils.rs Minor doc/comment and call-site updates for new API.
data/Cargo.toml Adds chunk_cache dependency.
Cargo.lock Locks new dependency edges.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 36 to 40
pub async fn new(
config: Arc<TranslatorConfig>,
progress_updater: Option<Arc<dyn TrackingProgressUpdater>>,
chunk_cache: Option<Arc<dyn ChunkCache>>,
) -> Result<Arc<Self>> {
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileDownloadSession::new/from_client are public APIs and their signatures changed by adding a new chunk_cache parameter. This is a breaking change for downstream crates, which conflicts with the PR description’s “backwards-compatible” claim. If backwards compatibility is required, consider keeping the old signatures (delegating to new *_with_chunk_cache variants) or adding defaulted convenience constructors so existing callers don’t need to change.

Copilot uses AI. Check for mistakes.
Comment thread data/src/lib.rs Outdated
Comment on lines +16 to +22
pub use cas_client::Client as CasClient;
pub use chunk_cache::{CacheConfig, ChunkCache, get_cache};
pub use deduplication::RawXorbData;
pub use file_download_session::FileDownloadSession;
pub use file_reconstruction::DownloadStream;
pub use file_upload_session::FileUploadSession;
pub use remote_client_interface::create_remote_client;
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description references a new data::create_client() API, but this module publicly re-exports create_remote_client instead. If create_client is the intended stable entry point for downstream consumers, consider adding a wrapper with that name/signature (e.g., deriving session_id internally) and/or updating the exports/PR description so they match.

Copilot uses AI. Check for mistakes.
Comment thread xet_data/src/file_reconstruction/reconstruction_terms/xorb_block.rs
Comment on lines +83 to +103
// Try the on-disk chunk cache before hitting the network.
if let Some(ref cache) = chunk_cache
&& let Ok(Some(cache_range)) = cache.get(&cache_key, &self.chunk_range).await
{
// Report the transfer bytes as completed so progress tracking stays consistent.
if let Some(ref updater) = progress_updater {
let (_, _, http_range) = url_info.get_retrieval_url(self.xorb_block_index).await;
let file_range = cas_types::FileRange::from(http_range);
let transfer_bytes = file_range.end.saturating_sub(file_range.start);
updater.report_transfer_progress(transfer_bytes);
}
let chunk_offsets: Vec<usize> = cache_range.offsets.iter().map(|&x| x as usize).collect();
let data = Bytes::from(cache_range.data);
let xorb_block_data = Arc::new(XorbBlockData {
chunk_offsets,
uncompressed_size: data.len() as u64,
data,
});
*xbd_lg = Some(xorb_block_data.clone());
return Ok(xorb_block_data);
}
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new chunk-cache fast path isn’t covered by tests. Since XorbBlock now supports (1) cache hit avoiding network and (2) best-effort cache population on miss, it would be good to add unit/integration coverage to verify get is consulted, put is called after download, and the returned chunk_offsets/data match expectations.

Copilot uses AI. Check for mistakes.
Comment on lines 61 to 66
@@ -64,7 +66,9 @@ impl FileTerm {
let xorb_block = self.xorb_block.clone();
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_data_task always acquires a download permit before calling retrieve_data. With the new chunk_cache path, this means cache hits can still block on (and consume) the network concurrency semaphore even though no HTTP request is made. Consider restructuring so the cache is checked before acquiring the permit (e.g., move permit acquisition into XorbBlock::retrieve_data only on cache miss, or add an early cache-check method).

Copilot uses AI. Check for mistakes.
@XciD
Copy link
Copy Markdown
Member Author

XciD commented Mar 1, 2026

Addressed all review comments:

#1 Breaking change: Updated PR description to explicitly note the breaking signature change. Only known consumers are hf_xet (already updated), so wrapper methods would be unnecessary surface.

#2 create_client vs create_remote_client: Fixed — we removed the create_client wrapper and made create_remote_client public directly. PR description updated to match.

#3 Empty prefix: Fixed in cf336ae — now using "default" to match PREFIX_DEFAULT / the rest of the codebase.

#4 Tests: Acknowledged, will add in a follow-up.

#5 Permit on cache hit: Fixed in cf336aeacquire_download_permit() moved from get_data_task into retrieve_data, after the chunk cache check. Cache hits no longer block on the network concurrency semaphore.

@XciD XciD changed the title Expose CAS client factory and optional chunk cache for hf-mount Expose CAS client factory and optional chunk cache Mar 1, 2026
@rajatarya
Copy link
Copy Markdown
Collaborator

@XciD : Can you share a bit about the context for these changes? Maybe I haven't read them carefully enough - is the goal to enable the chunk_cache on the new adaptive concurrency download/upload paths?

@XciD
Copy link
Copy Markdown
Member Author

XciD commented Mar 2, 2026

My comment have been removed, re-adding it:

I'm pocing a Fuse/NFS mount with xet for hf buckets, and those are change I need to enable xorb cache level on it.

@XciD XciD marked this pull request as ready for review March 2, 2026 18:46
@rajatarya
Copy link
Copy Markdown
Collaborator

My comment have been removed, re-adding it:

I'm pocing a Fuse/NFS mount with xet for hf buckets, and those are change I need to enable xorb cache level on it.

Ah - that makes sense! @assafvayner didn't you have a branch with this somewhere? I remember there being a demo after the offsite last year. Might make sense to resurrect that branch and build from it instead.

@assafvayner
Copy link
Copy Markdown
Contributor

this is my mount branch, but it's a outdates as far as as the new download interface. It does set up an nfs mount daemon though. https://github.com/huggingface/xet-core/tree/assaf/mount

It uses a fork (https://github.com/huggingface/nfsserve) of the original xetdata/nfsserve library that provides scaffolding for setting up nfs v3

@XciD XciD changed the title Expose CAS client factory and optional chunk cache feat: expose CAS client factory and optional chunk cache Mar 3, 2026
@XciD
Copy link
Copy Markdown
Member Author

XciD commented Mar 3, 2026

Thanks for sharing, did you share it somewhere? I can't find anything on Slack.
I started this weekend as a side project with a "S3 Fuse Mount"-like for HF Buckets, and as it grew, I decided to impl both NFS + FUSE inside the same project: https://github.com/huggingface-internal/hf-mount

Would love your thoughts on the implementation. Was done with Claude, and re-read carefully. I think I will post on Slack later this week if we can start to test it over science cluster for example.

Rebase of the hf-mount integration work onto the reorganized codebase.

Changes:
- Expose CAS client factory (create_remote_client), ChunkCache, CacheConfig,
  get_cache, and SingleFileCleaner from the data processing module
- Thread optional chunk_cache parameter through FileDownloadSession,
  FileReconstructor, FileTerm, and XorbBlock
- Add download_stream_from_offset for streaming from arbitrary byte offset
- Add with_file_size/with_chunk_cache builders on FileReconstructor
- Integrate chunk cache in XorbBlock: check cache before HTTP download,
  populate cache after successful download (non-blocking via tokio::spawn)
- Use file_size to bound reconstruction range for full-file downloads,
  preventing speculative requests beyond EOF
@XciD XciD force-pushed the feat/hf-mount-integration branch from b099c05 to 84f102e Compare March 11, 2026 19:36
When the final file size is unknown (streaming/append-only uploads),
passing size=0 with is_final_size_known=true causes debug_assert
panics when completed_bytes exceeds total_bytes=0. Changing the
parameter to Option<u64> lets callers pass None, which sets
is_final_size_known=false and allows incremental size updates.

This change was present before the rebase but got lost during the
package restructure squash.
XciD added a commit to huggingface/hf-mount that referenced this pull request Mar 12, 2026
Upgrade from rev b099c05 (8 separate crates) to rev 0d7808b
(4 consolidated crates) following the xet-core package restructure:

- cas_client, cas_types -> xet-client
- mdb_shard, merklehash, xorb_object -> xet-core-structures
- data, file_reconstruction -> xet-data
- utils -> xet-runtime

Also absorbs API changes from huggingface/xet-core#675:
- FileDownloadSession::from_client() accepts optional ChunkCache
- start_clean() takes Sha256Policy instead of Option<Sha256>
- start_clean() size parameter is now Option<u64> for streaming uploads
- FileReconstructor gains with_file_size() builder method
XciD added a commit to huggingface/hf-mount that referenced this pull request Mar 12, 2026
## Summary

- Upgrade xet-core from `b099c05` (8 separate crates) to `0d7808b` (4
consolidated crates) following the package restructure (#693)
- Absorbs API changes from huggingface/xet-core#675 (chunk cache, file
size bounds, Sha256Policy)
- `cas_client` + `cas_types` -> `xet-client`
- `mdb_shard` + `merklehash` + `xorb_object` -> `xet-core-structures`
- `data` + `file_reconstruction` -> `xet-data`
- `utils` -> `xet-runtime`

Depends on: huggingface/xet-core#675
XciD added a commit that referenced this pull request Mar 17, 2026
…hashes-and-compose (#717)

Combined branch for hf-mount that needs both CAS client factory (PR 675)
and range upload/compose APIs (PR 717). Chunk cache integration from 675
is stubbed out (param accepted but unused) since the struct shapes diverged.
@XciD
Copy link
Copy Markdown
Member Author

XciD commented Mar 18, 2026

Will split this one in multiple small PR

#730 #731 #732

@XciD XciD closed this Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants