docs: add kserve integration guide#443
Conversation
Signed-off-by: Lasse4 <lasse@vierow.de>
| @@ -0,0 +1,379 @@ | |||
| [KServe](https://kserve.github.io/website/) is a Kubernetes-native model inference platform. By fronting KServe with agentgateway, you can apply agent-aware policies, including token-based rate limiting, to your model serving endpoints without modifying your inference services. | |||
There was a problem hiding this comment.
We might want to rephrase the agent-aware policy to be more specific to policies that can be applied when routing to inference models:
| [KServe](https://kserve.github.io/website/) is a Kubernetes-native model inference platform. By fronting KServe with agentgateway, you can apply agent-aware policies, including token-based rate limiting, to your model serving endpoints without modifying your inference services. | |
| [KServe](https://kserve.github.io/website/) is a Kubernetes-native platform for serving machine learning models. With agentgateway in front of KServe, you can enforce traffic management policies, such as token-based rate limiting, for inference requests without modifying your inference services. |
| ## Before you begin | ||
|
|
||
| {{< callout type="info" >}} | ||
| Make sure you installed the Experimental Version. |
There was a problem hiding this comment.
I think we want Latest Version here. Experimental is a little confusing since it refers to experimental apis in Gateway API. Maybe rephrase to be version v1.2+ ?
There was a problem hiding this comment.
I am thinking we can remove this note because this will only show up in main, which is for the 1.2 version.
|
|
||
| ## Step 4: Deploy a mocked LLM with httpbun | ||
|
|
||
| Instead of a real model, this guide uses [httpbun](https://httpbun.com/) to serve a mock OpenAI compatible endpoint. httpbun's `/llm/chat/completions` path returns a properly structured OpenAI chat completion response, including `usage.total_tokens` in the response body, which agentgateway reads to enforce token-based rate limits. |
There was a problem hiding this comment.
Do we want to try https://kserve.github.io/website/docs/getting-started/genai-first-llmisvc?_highlight=inferenceservice instead of httpbun here?
|
|
||
| ## Step 1: Install cert-manager | ||
|
|
||
| 1. KServe requires cert-manager for webhook certificates. |
There was a problem hiding this comment.
Start steps with imperative verbs.
| 1. KServe requires cert-manager for webhook certificates. | |
| 1. Install cert-manager, which KServe requires for webhook certificates. |
| EOF | ||
| ``` | ||
|
|
||
| Wait for the `InferenceService` to become ready. |
There was a problem hiding this comment.
Use step formatting for these subsequent checks.
| Wait for the `InferenceService` to become ready. | |
| 3. Wait for the `InferenceService` to become ready. |
| kubectl get inferenceservices mock-llm -n kserve-test --watch | ||
| ``` | ||
|
|
||
| Once `READY` is `True`, KServe creates an `HTTPRoute` that attaches to the agentgateway. Verify it. The route attaches to `kserve/kserve-ingress-gateway` with hostname `mock-llm-kserve-test.example.com`. |
There was a problem hiding this comment.
| Once `READY` is `True`, KServe creates an `HTTPRoute` that attaches to the agentgateway. Verify it. The route attaches to `kserve/kserve-ingress-gateway` with hostname `mock-llm-kserve-test.example.com`. | |
| 4. Verify that KServe created an HTTPRoute after the Gateway becomes `READY`. The route attaches to `kserve/kserve-ingress-gateway` with hostname `mock-llm-kserve-test.example.com`. |
|
|
||
| ## Step 5: Create an AgentgatewayBackend | ||
|
|
||
| KServe generates the `HTTPRoute` with a plain Kubernetes `Service` as the `backendRef`. Agentgateway only applies token-based rate limiting to traffic that flows through an `AgentgatewayBackend` with `spec.ai.provider` configured, because that is what signals to the proxy that the backend is an LLM and that response bodies contain a `usage.total_tokens` field to count against the rate limit bucket. |
There was a problem hiding this comment.
The second sentence is a bit complex, maybe split it up and rephrase for readability.
Also, combine with the note later on so you can remove it there, since they say similar things. In this rephrase, I also would remove the language about agentgateway supporting responses natively in AgentgatewayPolicy because we typically do not commit to or imply support for future releases in the docs.
| KServe generates the `HTTPRoute` with a plain Kubernetes `Service` as the `backendRef`. Agentgateway only applies token-based rate limiting to traffic that flows through an `AgentgatewayBackend` with `spec.ai.provider` configured, because that is what signals to the proxy that the backend is an LLM and that response bodies contain a `usage.total_tokens` field to count against the rate limit bucket. | |
| KServe generates the `HTTPRoute` with a plain Kubernetes `Service` as the `backendRef`. However, to apply a token-based rate limiting policy, agentgateway needs the backend to be an AgentgatewayBackend. This way, agentgateway knows that the backend is an LLM that has a response body with the `usage.total_tokens` field to count against the rate limit bucket. In the following steps, you create an AgentgatewayBackend and a second HTTPRoute to route to it as a workaround to the KServe-created, Service-based setup. |
|
|
||
| KServe generates the `HTTPRoute` with a plain Kubernetes `Service` as the `backendRef`. Agentgateway only applies token-based rate limiting to traffic that flows through an `AgentgatewayBackend` with `spec.ai.provider` configured, because that is what signals to the proxy that the backend is an LLM and that response bodies contain a `usage.total_tokens` field to count against the rate limit bucket. | ||
|
|
||
| 1. Create an `AgentgatewayBackend` that points at the httpbun predictor service. |
There was a problem hiding this comment.
I am not sure what predictor means, is it needed? maybe just
| 1. Create an `AgentgatewayBackend` that points at the httpbun predictor service. | |
| 1. Create an `AgentgatewayBackend` that points at the httpbun service. |
| {{< callout type="info" >}} | ||
| This extra `HTTPRoute` is a current workaround. Token-based rate limiting requires traffic to flow through an `AgentgatewayBackend` so the proxy knows to inspect the response body for `usage.total_tokens`. A future agentgateway release may support activating LLM-aware response parsing directly on an `AgentgatewayPolicy`, which would remove the need for this step. | ||
| {{< /callout >}} |
There was a problem hiding this comment.
Maybe move this to the shortdesc and delete here, see previous comment.
| {{< callout type="info" >}} | |
| This extra `HTTPRoute` is a current workaround. Token-based rate limiting requires traffic to flow through an `AgentgatewayBackend` so the proxy knows to inspect the response body for `usage.total_tokens`. A future agentgateway release may support activating LLM-aware response parsing directly on an `AgentgatewayPolicy`, which would remove the need for this step. | |
| {{< /callout >}} |
| ## Step 6: Apply token-based rate limiting | ||
| How token counting works: Agentgateway reads `usage.total_tokens` from the JSON response body returned by the inference service. Each request deducts that many tokens from the bucket. When the bucket empties, subsequent requests receive `429 Too Many Requests` until the next fill interval. | ||
|
|
||
| 1. Apply an `AgentgatewayPolicy` that caps requests at **100 tokens per minute**. The policy targets the `mock-llm-ai` route that flows through the `AgentgatewayBackend`. |
There was a problem hiding this comment.
| 1. Apply an `AgentgatewayPolicy` that caps requests at **100 tokens per minute**. The policy targets the `mock-llm-ai` route that flows through the `AgentgatewayBackend`. | |
| 1. Apply an `AgentgatewayPolicy` that caps requests at **100 tokens per minute**. The policy targets the `mock-llm-ai` route that selects the `AgentgatewayBackend`. |
|
|
||
| ```shell | ||
| kubectl get httproute mock-llm -n kserve-test -o yaml | ||
| ``` |
There was a problem hiding this comment.
I wonder if it's more confusing if we highlight this route is created by kserve but then don't use it? I think there's a couple options here:
- Remove the call out to
kubectl get httproute mock-llm -n kserve-test -o yamlsection and only focus on the new HTTPRoute that references the AgentgatewayBackend - Add an example policy that you can attach to the HTTPRoute that's generated by KServe (maybe tracing or a transformation?)
There was a problem hiding this comment.
For the transformation it could be something like this:
traffic:
transformation:
response:
set:
- name: x-requested-model
value: 'string(json(request.body).model)'
- name: x-actual-model
value: 'string(json(response.body).model)'
This PR is part of kgateway-dev/kgateway.dev#606