docs: add kserve integration guide by Lasse4 · Pull Request #443 · agentgateway/website

Lasse4 · 2026-04-30T05:23:59Z

This PR is part of kgateway-dev/kgateway.dev#606

Signed-off-by: Lasse4 <lasse@vierow.de>

npolshakova · 2026-05-07T02:07:17Z

@@ -0,0 +1,379 @@
+[KServe](https://kserve.github.io/website/) is a Kubernetes-native model inference platform. By fronting KServe with agentgateway, you can apply agent-aware policies, including token-based rate limiting, to your model serving endpoints without modifying your inference services.


We might want to rephrase the agent-aware policy to be more specific to policies that can be applied when routing to inference models:

Suggested change

[KServe](https://kserve.github.io/website/) is a Kubernetes-native model inference platform. By fronting KServe with agentgateway, you can apply agent-aware policies, including token-based rate limiting, to your model serving endpoints without modifying your inference services.

[KServe](https://kserve.github.io/website/) is a Kubernetes-native platform for serving machine learning models. With agentgateway in front of KServe, you can enforce traffic management policies, such as token-based rate limiting, for inference requests without modifying your inference services.

npolshakova · 2026-05-07T02:08:22Z

+## Before you begin
+
+{{< callout type="info" >}}
+Make sure you installed the Experimental Version.


I think we want Latest Version here. Experimental is a little confusing since it refers to experimental apis in Gateway API. Maybe rephrase to be version v1.2+ ?

I am thinking we can remove this note because this will only show up in main, which is for the 1.2 version.

npolshakova · 2026-05-07T02:09:54Z

+
+## Step 4: Deploy a mocked LLM with httpbun
+
+Instead of a real model, this guide uses [httpbun](https://httpbun.com/) to serve a mock OpenAI compatible endpoint. httpbun's `/llm/chat/completions` path returns a properly structured OpenAI chat completion response, including `usage.total_tokens` in the response body, which agentgateway reads to enforce token-based rate limits.


Do we want to try https://kserve.github.io/website/docs/getting-started/genai-first-llmisvc?_highlight=inferenceservice instead of httpbun here?

artberger · 2026-05-07T12:58:28Z

+
+## Step 1: Install cert-manager
+
+1. KServe requires cert-manager for webhook certificates.


Start steps with imperative verbs.

Suggested change

1. KServe requires cert-manager for webhook certificates.

1. Install cert-manager, which KServe requires for webhook certificates.

artberger · 2026-05-07T13:00:31Z

+   EOF
+   ```
+
+   Wait for the `InferenceService` to become ready.


Use step formatting for these subsequent checks.

Suggested change

Wait for the `InferenceService` to become ready.

3. Wait for the `InferenceService` to become ready.

artberger · 2026-05-07T13:01:12Z

+   kubectl get inferenceservices mock-llm -n kserve-test --watch
+   ```
+
+   Once `READY` is `True`, KServe creates an `HTTPRoute` that attaches to the agentgateway. Verify it. The route attaches to `kserve/kserve-ingress-gateway` with hostname `mock-llm-kserve-test.example.com`.


Suggested change

Once `READY` is `True`, KServe creates an `HTTPRoute` that attaches to the agentgateway. Verify it. The route attaches to `kserve/kserve-ingress-gateway` with hostname `mock-llm-kserve-test.example.com`.

4. Verify that KServe created an HTTPRoute after the Gateway becomes `READY`. The route attaches to `kserve/kserve-ingress-gateway` with hostname `mock-llm-kserve-test.example.com`.

artberger · 2026-05-07T13:04:08Z

+
+## Step 5: Create an AgentgatewayBackend
+
+KServe generates the `HTTPRoute` with a plain Kubernetes `Service` as the `backendRef`. Agentgateway only applies token-based rate limiting to traffic that flows through an `AgentgatewayBackend` with `spec.ai.provider` configured, because that is what signals to the proxy that the backend is an LLM and that response bodies contain a `usage.total_tokens` field to count against the rate limit bucket.


The second sentence is a bit complex, maybe split it up and rephrase for readability.

Also, combine with the note later on so you can remove it there, since they say similar things. In this rephrase, I also would remove the language about agentgateway supporting responses natively in AgentgatewayPolicy because we typically do not commit to or imply support for future releases in the docs.

Suggested change

KServe generates the `HTTPRoute` with a plain Kubernetes `Service` as the `backendRef`. Agentgateway only applies token-based rate limiting to traffic that flows through an `AgentgatewayBackend` with `spec.ai.provider` configured, because that is what signals to the proxy that the backend is an LLM and that response bodies contain a `usage.total_tokens` field to count against the rate limit bucket.

KServe generates the `HTTPRoute` with a plain Kubernetes `Service` as the `backendRef`. However, to apply a token-based rate limiting policy, agentgateway needs the backend to be an AgentgatewayBackend. This way, agentgateway knows that the backend is an LLM that has a response body with the `usage.total_tokens` field to count against the rate limit bucket. In the following steps, you create an AgentgatewayBackend and a second HTTPRoute to route to it as a workaround to the KServe-created, Service-based setup.

artberger · 2026-05-07T13:04:30Z

+
+KServe generates the `HTTPRoute` with a plain Kubernetes `Service` as the `backendRef`. Agentgateway only applies token-based rate limiting to traffic that flows through an `AgentgatewayBackend` with `spec.ai.provider` configured, because that is what signals to the proxy that the backend is an LLM and that response bodies contain a `usage.total_tokens` field to count against the rate limit bucket.
+
+1. Create an `AgentgatewayBackend` that points at the httpbun predictor service.


I am not sure what predictor means, is it needed? maybe just

Suggested change

1. Create an `AgentgatewayBackend` that points at the httpbun predictor service.

1. Create an `AgentgatewayBackend` that points at the httpbun service.

artberger · 2026-05-07T13:05:40Z

+   {{< callout type="info" >}}
+   This extra `HTTPRoute` is a current workaround. Token-based rate limiting requires traffic to flow through an    `AgentgatewayBackend` so the proxy knows to inspect the response body for `usage.total_tokens`. A future    agentgateway release may support activating LLM-aware response parsing directly on an `AgentgatewayPolicy`,    which would remove the need for this step.
+   {{< /callout >}}


Maybe move this to the shortdesc and delete here, see previous comment.

Suggested change

{{< callout type="info" >}}

This extra `HTTPRoute` is a current workaround. Token-based rate limiting requires traffic to flow through an `AgentgatewayBackend` so the proxy knows to inspect the response body for `usage.total_tokens`. A future agentgateway release may support activating LLM-aware response parsing directly on an `AgentgatewayPolicy`, which would remove the need for this step.

{{< /callout >}}

artberger · 2026-05-07T13:09:33Z

+## Step 6: Apply token-based rate limiting
+How token counting works: Agentgateway reads `usage.total_tokens` from the JSON response body returned by the inference service. Each request deducts that many tokens from the bucket. When the bucket empties, subsequent requests receive `429 Too Many Requests` until the next fill interval.
+
+1. Apply an `AgentgatewayPolicy` that caps requests at **100 tokens per minute**. The policy targets the `mock-llm-ai` route that flows through the `AgentgatewayBackend`.


Suggested change

1. Apply an `AgentgatewayPolicy` that caps requests at **100 tokens per minute**. The policy targets the `mock-llm-ai` route that flows through the `AgentgatewayBackend`.

1. Apply an `AgentgatewayPolicy` that caps requests at **100 tokens per minute**. The policy targets the `mock-llm-ai` route that selects the `AgentgatewayBackend`.

npolshakova · 2026-05-07T14:14:03Z

+
+   ```shell
+   kubectl get httproute mock-llm -n kserve-test -o yaml
+   ```


I wonder if it's more confusing if we highlight this route is created by kserve but then don't use it? I think there's a couple options here:

Remove the call out to kubectl get httproute mock-llm -n kserve-test -o yaml section and only focus on the new HTTPRoute that references the AgentgatewayBackend

Add an example policy that you can attach to the HTTPRoute that's generated by KServe (maybe tracing or a transformation?)

For the transformation it could be something like this:

traffic: transformation: response: set: - name: x-requested-model value: 'string(json(request.body).model)' - name: x-actual-model value: 'string(json(response.body).model)'

add kserve doc

01ce646

Signed-off-by: Lasse4 <lasse@vierow.de>

artberger assigned Lasse4 Apr 30, 2026

artberger mentioned this pull request Apr 30, 2026

Ecosystem Integrations & Tutorials for agentgateway kgateway-dev/kgateway.dev#606

Open

8 tasks

Merge branch 'main' into docs-add-kserve-integration

83bdbf2

Lasse4 marked this pull request as ready for review May 4, 2026 17:26

npolshakova reviewed May 7, 2026

View reviewed changes

artberger reviewed May 7, 2026

View reviewed changes

npolshakova reviewed May 7, 2026

View reviewed changes

		@@ -0,0 +1,379 @@
		[KServe](https://kserve.github.io/website/) is a Kubernetes-native model inference platform. By fronting KServe with agentgateway, you can apply agent-aware policies, including token-based rate limiting, to your model serving endpoints without modifying your inference services.


		## Step 4: Deploy a mocked LLM with httpbun

		Instead of a real model, this guide uses [httpbun](https://httpbun.com/) to serve a mock OpenAI compatible endpoint. httpbun's `/llm/chat/completions` path returns a properly structured OpenAI chat completion response, including `usage.total_tokens` in the response body, which agentgateway reads to enforce token-based rate limits.


		## Step 1: Install cert-manager

		1. KServe requires cert-manager for webhook certificates.

	1. KServe requires cert-manager for webhook certificates.
	1. Install cert-manager, which KServe requires for webhook certificates.

	Wait for the `InferenceService` to become ready.
	3. Wait for the `InferenceService` to become ready.

	Once `READY` is `True`, KServe creates an `HTTPRoute` that attaches to the agentgateway. Verify it. The route attaches to `kserve/kserve-ingress-gateway` with hostname `mock-llm-kserve-test.example.com`.
	4. Verify that KServe created an HTTPRoute after the Gateway becomes `READY`. The route attaches to `kserve/kserve-ingress-gateway` with hostname `mock-llm-kserve-test.example.com`.


		## Step 5: Create an AgentgatewayBackend

		KServe generates the `HTTPRoute` with a plain Kubernetes `Service` as the `backendRef`. Agentgateway only applies token-based rate limiting to traffic that flows through an `AgentgatewayBackend` with `spec.ai.provider` configured, because that is what signals to the proxy that the backend is an LLM and that response bodies contain a `usage.total_tokens` field to count against the rate limit bucket.


		KServe generates the `HTTPRoute` with a plain Kubernetes `Service` as the `backendRef`. Agentgateway only applies token-based rate limiting to traffic that flows through an `AgentgatewayBackend` with `spec.ai.provider` configured, because that is what signals to the proxy that the backend is an LLM and that response bodies contain a `usage.total_tokens` field to count against the rate limit bucket.

		1. Create an `AgentgatewayBackend` that points at the httpbun predictor service.

	{{< callout type="info" >}}
	This extra `HTTPRoute` is a current workaround. Token-based rate limiting requires traffic to flow through an `AgentgatewayBackend` so the proxy knows to inspect the response body for `usage.total_tokens`. A future agentgateway release may support activating LLM-aware response parsing directly on an `AgentgatewayPolicy`, which would remove the need for this step.
	{{< /callout >}}

	1. Apply an `AgentgatewayPolicy` that caps requests at 100 tokens per minute. The policy targets the `mock-llm-ai` route that flows through the `AgentgatewayBackend`.
	1. Apply an `AgentgatewayPolicy` that caps requests at 100 tokens per minute. The policy targets the `mock-llm-ai` route that selects the `AgentgatewayBackend`.

Conversation

Lasse4 commented Apr 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

npolshakova May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

npolshakova May 7, 2026 •

edited

Loading