📺 Live Demo — Browser-based illustration of
estimateQoS()with interactive micro-benchmarks
- Authors
- Participate
- Introduction
- User-Facing Problem
- User research
- Use Cases
- Proposed Approach: Model-Centric Evaluation (Callee Responsible)
- Implementation Considerations (AI Stack Internals)
- Discussion: Potential API Enhancements
- Alternatives considered
- Accessibility, Internationalization, Privacy, and Security Considerations
- Stakeholder Feedback / Opposition
- References & acknowledgements
- Jonathan Ding (Intel)
This proposal addresses the challenge of efficiently offloading AI inference tasks from cloud servers to client devices while maintaining Quality of Service (QoS). This protocol provides a more effective mechanism for applications to evaluate whether a specific AI inference request is suitable for execution on the client side. It moves beyond static hardware specifications by enabling dynamic, privacy-preserving assessment of device capabilities, helping applications make informed offloading decisions. Throughout this document, the Application (App) represents the decision-making logic, which may reside on the client device (e.g., in a web browser) or on a cloud server.
Modern web applications increasingly rely on AI, but running these models solely in the cloud can be expensive and introduce latency. Conversely, running them on client devices is difficult because developers cannot easily determine if a target device—given its specific CPU, GPU, and NPU capabilities—can host a specific AI model without compromising performance or user privacy.
- Standardize a mechanism for identifying device performance buckets for AI tasks.
- Enable efficient offloading of AI inference from cloud to client devices.
- Maintain high Quality of Service (QoS) during offloading.
- Protect user privacy by avoiding detailed hardware fingerprinting.
- Provide a future-proof abstraction that works across varying hardware (CPU, GPU, NPU).
- Define a protocol that works regardless of whether the decision logic resides in the App's cloud backend or client frontend.
- Defining the specific wire protocol for model transmission (this focuses on the negotiation/estimation).
- Mandatory implementation of any specific inference engine.
- Solving all AI workload types in version 1 (e.g., extremely large LLMs with dynamic shapes).
[Placeholder for user research findings. Initial feedback from ISVs and web developers indicates a strong need for predictable client-side AI performance metrics.]
A video conferencing application wants to offload background blur processing to the user's laptop to save server costs and improve privacy, but only if the device can maintain a stable 30fps.
- Inquiry: The application builds a weightless graph of its blur model and calls
context.estimateQoS(). - Estimation: The device evaluates its capability by integrating a wide range of local intelligence: the AI stack software (including specialized drivers and runtimes), the specific hardware accelerators, current system state (thermal state, battery level, power mode), and environmental configurations that might affect performance.
- Decision:
- If the
performanceTiermeets the application's requirements (e.g., "excellent", "good", or "fair" for real-time video), the application logic decides to download the full weights, bind them, and run locally. - Otherwise (e.g., "slow", "very-slow", "poor"), it falls back to cloud-based processing.
- If the
A photo editing web app wants to run complex enhancement filters using the user's mobile NPU to reduce latency and maintain privacy.
- Inquiry: The application provides a weightless description of the enhancement model to
context.estimateQoS(), including specific target resolutions. - Estimation: The device evaluates its capability by considering the current hardware and software environment, including AI stack optimizations, hardware accelerators (such as NPU), and overall system state (e.g., battery level, power mode, thermal conditions).
- Decision: The application enables the "High Quality" filter locally if the performance tier meets the requirements.
The preferred approach is Model-Centric, where the device (the callee, i.e., the responder to the AI request) is responsible for evaluating its own ability to handle the requested AI workload. In this model, the Application (the caller) sends a Model Description Inquiry—a weightless description of the AI model and input characteristics—to the device. The device, as the callee, uses its local knowledge of hardware, current system state, software environment, and implementation details to estimate the expected Quality of Service (QoS) for the given task.
sequenceDiagram
participant App as App
participant Device as Device
participant Cloud as Cloud LLM
App->>Device: Weightless Model Description & Input Metadata
Note over Device: UA/AI Stack runs Local Estimation<br/>(Internal: Static / Dry Run / Black Box)
Device-->>App: Return QoS Bucket (Performance Tier)
Note over App: App makes Decision<br/>(Compare QoS vs Requirement)
alt App Decides: Offload
App->>Device: Bind Weights & Run Locally
else App Decides: Cloud
App->>Cloud: Execute on Cloud
end
This "callee responsible" design ensures that sensitive device details remain private, as only broad performance tiers are reported back to the application. It also allows the device to make the most accurate estimation possible, considering real-time factors like thermal state, background load, and hardware-specific optimizations that are not visible to the caller (whether the caller logic is in the cloud or on the client). By shifting responsibility for QoS evaluation to the callee, the protocol achieves both privacy protection and more reliable offloading decisions.
To enable consistent cross-vendor estimation, the protocol requires standardization of the following inputs and outputs:
- Weightless Model Description:
- Based on the WebNN Graph topology.
- Includes Lazy Bind Constants: Placeholders for weights (via descriptors and labels) that enable "weightless" graph construction and estimation without downloading large parameter files.
- Dynamic vs. Static Graph Expression: This proposal currently uses the dynamic WebNN
MLGraphBuilderAPI to construct the weightless graph at runtime. An alternative approach is to express the graph topology statically using a declarative format. The webnn-graph project defines a WebNN-oriented graph DSL (.webnnformat) that separates the graph definition (structure only, no tensor data) from a weights manifest and binary weights file. This static representation is human-readable, diffable, and enables tooling such as ONNX-to-WebNN conversion and graph visualization. A future version of DAOP could accept either a dynamically builtMLGraphor a statically defined.webnngraph description as input toestimateQoS().
- Model Metadata (Optional):
- Information about the weights that can significantly impact performance, such as sparsity or specific quantization schemes.
- Input Characterization:
- The shape and size of the input data (e.g., image resolution, sequence length).
- QoS Output:
- Unified Performance Tiers (e.g., "excellent", "good", "fair", "moderate", "slow", "very-slow", "poor") to ensure hardware abstraction and prevent privacy-leaking through precise latency metrics.
We proposes a core API for performance negotiation:
dictionary MLQoSReport {
MLPerformanceTier performanceTier;
};
partial interface MLContext {
Promise<MLQoSReport> estimateQoS(MLGraph graph, optional MLQoSOptions options);
};
dictionary MLQoSOptions {
// Input characteristics
record<DOMString, MLOperandDescriptor> inputDescriptors;
// Weights characteristics (Optional)
boolean weightsSparsity = false;
};To maximize the benefits of DAOP, the underlying WebNN specification should support a weightless
build mode. Currently, WebNN's constant() API typically requires an ArrayBufferView, which
implies the weights must already be present in memory.
We propose that WebNN builders be extended to allow:
- Weightless Constants: Defining constants using only their descriptor (shape, type) and a
labelfor late-binding. - Lazy / Explicit Binding: Separating the graph topology definition from the binding of heavy
weight data. By using an explicit
bindConstants()(or similar) method, we achieve lazy binding where weights are only provided and processed after the offloading decision is made. This design aligns with the proposal in webnn#901, which addresses the same fundamental problem from a memory-efficiency perspective. That proposal allowsbuilder.constant()to accept just anMLOperandDescriptor(shape and type, noArrayBufferView), producing a "hollow constant" handle. Afterbuilder.build(), weights are streamed to device memory one at a time viagraph.setConstantData(constantOperand, dataBuffer), reducing peak CPU memory from ~3× model size to ~1×. OurbindConstants()API could be integrated with or replaced by thissetConstantData()mechanism in a future version of the spec, combining the benefits of weightless QoS estimation with memory-efficient weight loading.
The estimateQoS() API returns a performanceTier string that represents the device's estimated
ability to execute the given graph. The tiers are designed to be broad enough to prevent hardware
fingerprinting while still enabling meaningful offloading decisions:
| Tier | Indicative Latency | Interpretation |
|---|---|---|
"excellent" |
< 16 ms | Real-time (60 fps frame budget) |
"good" |
< 100 ms | Interactive responsiveness |
"fair" |
< 1 s | Responsive for non-real-time tasks |
"moderate" |
< 10 s | Tolerable for batch or one-shot tasks |
"slow" |
< 30 s | Noticeable wait |
"very-slow" |
< 60 s | Long wait |
"poor" |
≥ 60 s | Likely unacceptable for most use cases |
The exact tier boundaries are implementation-defined and may be adjusted. The key requirement is that tiers remain coarse enough to avoid fingerprinting while fine enough for applications to make useful offloading decisions.
Applications choose their own acceptance threshold based on use-case requirements. For example, a video conferencing blur might require "good" or better, while a one-shot photo enhancement might accept "moderate".
The underlying system (e.g., User Agent or WebNN implementation) can use several strategies to estimate performance for the weightless graph. These strategies are internal implementation details of the AI stack and are transparent to the application developer. It is important to note that these strategies are not part of the DAOP specification or proposal; they are discussed here only to illustrate possible implementation choices and feasibility. Common techniques include:
- Static Cost Model: Analytical formulas (e.g., Roofline model) or lookup tables to predict operator costs based on descriptors.
- Dry Run: Fast simulation of the inference engine's execution path without heavy computation or weights.
- Black Box Profiling: Running the actual model topology using dummy/zero weights to measure timing.
For a concrete demonstration of these techniques, see the daop-illustration project and its implementation details. It showcases a Static Cost Model strategy that employs log-log polynomial interpolation of measured operator latencies derived from per-operator micro-benchmarks. By fitting degree-1 polynomials (power-law curves) to latency data across multiple tensor sizes in logarithmic space, with a left-side clamp to handle small-size noise, the implementation captures performance characteristics common in GPU-accelerated workloads. This illustration uses a simplified approach for demonstration purposes; production implementations could employ other strategies such as Roofline models, learned cost models, hardware-specific operator libraries, or ML-based performance predictors. These internal metrics (regression coefficients, estimated throughput) are internal implementation details of the AI stack and are never exposed directly to the web application.
To prevent hardware fingerprinting, the raw estimation results are normalized into broad Performance Tiers before being returned to the web application. The application logic remains decoupled from the hardware-specific details.
The following example shows how an application might use the API to decide whether to offload.
// 1. Initialize WebNN context
const context = await navigator.ml.createContext({ deviceType: "npu" });
const builder = new MLGraphBuilder(context);
// 2. Build a WEIGHTLESS graph
const weights = builder.constant({
shape: [3, 3, 64, 64],
dataType: "float32",
label: "modelWeights", // Identity for late-binding meta-data
});
const input = builder.input("input", { shape: [1, 3, 224, 224], dataType: "float32" });
const output = builder.conv2d(input, weights);
const graph = builder.build();
// 3. DAOP Estimation: Providing input characteristics
const qos = await context.estimateQoS(graph, {
inputDescriptors: {
input: { shape: [1, 3, 720, 1280], dataType: "float32" },
},
});
// Check if the performance tier meets our requirements
const acceptable = ["excellent", "good", "fair", "moderate"];
if (acceptable.includes(qos.performanceTier)) {
const realWeights = await fetch("model-weights.bin").then((r) => r.arrayBuffer());
// 4. Bind real data (using the label) explicitly.
await context.bindConstants(graph, {
modelWeights: realWeights,
});
// 5. Subsequent compute calls only need dynamic inputs
const results = await context.compute(graph, {
input: cameraFrame,
});
} else {
runCloudInference();
}We are considering several additions to the API to better support adaptive applications:
Instead of returning a bucket, the application could provide its specific requirements (e.g., minimum FPS or maximum latency) and receiving a simple boolean "can meet requirement" response.
partial interface MLContext {
Promise<boolean> meetsRequirement(MLGraph graph, MLPerformanceTier requiredTier, optional MLQoSOptions options);
};AI performance can change dynamically due to thermal throttling, battery state, or background system load. An event-driven mechanism would allow applications to react when the device's ability to meet a specific QoS requirement changes.
interface MLQoSChangeEvent : Event {
readonly attribute boolean meetsRequirement;
};
// Application listens for changes in offload capability
const monitor = context.createQoSMonitor(graph, "excellent");
monitor.onchange = (e) => {
if (!e.meetsRequirement) {
console.log("Performance dropped, switching back to cloud.");
switchToCloud();
} else {
console.log("Performance restored, offloading to local.");
switchToLocal();
}
};In this alternative, the Application acts as the central intelligence. It collects raw hardware specifications and telemetry from the device and makes the offloading decision.
sequenceDiagram
participant App as App
participant Device as Device
participant Cloud as Cloud LLM
App->>Device: Request Device Description
Device-->>App: Return Spec (CPU, GPU, NPU, Mem, Microbenchmarks...)
Note over App: App estimates QoS<br/>(Mapping H/W Spec -> AI Performance)
Note over App: App makes Decision<br/>(Compare QoS vs Requirement)
alt App Decides: Offload
App->>Device: Execute locally
else App Decides: Cloud
App->>Cloud: Execute on Cloud
end
- Process: Device returns specific hardware details (CPU model, GPU frequency, NPU TOPs, micro-benchmark results) -> Application estimates QoS -> Application decides to offload.
- Why rejected:
- Privacy Risks: Exposes detailed hardware fingerprints and potentially sensitive system telemetry to remote servers.
- Estimation Complexity: It is extremely difficult for a remote server to accurately map raw hardware specs to actual inference performance across a fragmented device ecosystem (ignoring local drivers, thermal state, and OS-level optimizations).
- Scalability: Requires maintaining and constantly updating an impractical global database mapping every possible device configuration to AI performance profiles.
The Model-Centric approach significantly enhances privacy by:
- Avoiding hardware fingerprinting.
- Returning broad Performance Tiers rather than exact hardware identifiers or precise latency metrics.
- Enabling local processing of sensitive user data (like photos or video) that would otherwise need to be sent to the cloud.
- Weightless model descriptions should be validated to prevent malicious topologies from causing resource exhaustion (DoS) during the estimation phase.
- [Implementors/ISVs]: Initial interests from several ISVs, to be documented.
Many thanks for valuable feedback and advice from the contributors to the WebNN and Web Machine Learning Working Group.