Skip to content

Address review feedback on #2896#2921

Merged
dgageot merged 1 commit into
docker:mainfrom
trungutt:address-oauth-review-feedback
May 29, 2026
Merged

Address review feedback on #2896#2921
dgageot merged 1 commit into
docker:mainfrom
trungutt:address-oauth-review-feedback

Conversation

@trungutt
Copy link
Copy Markdown
Contributor

@trungutt trungutt commented May 29, 2026

Why

#2896 introduced the in-process unmanaged OAuth drive-flow and the out-of-band callback route. The PR landed before review feedback (from the bot and from @dgageot) could be folded in. This PR addresses each point.

Changes

1. Docs — cagent/*docker-agent/* (per @dgageot)

docs/features/remote-mcp/index.md still referenced the legacy meta-key names (cagent/type, cagent/server_url, cagent/authorize_url, cagent/state) while the runtime emits docker-agent/*. Embedders following the docs would have looked up the wrong keys. Renamed in-place. Also renamed "legacy" → "client-driven" in the same file for the mode that does not set --mcp-oauth-redirect-uri, matching the PR-body terminology.

2. Drop the tautological state check on the deeplink-callback path (bot, MEDIUM)

On the direct-callback path, handleUnmanagedOAuthFlow was synthesising the payload consumeUnmanagedElicitationReply expects:

content = map[string]any{"code": cb.Code, "state": expectedState}

Then consumeUnmanagedElicitationReply did:

if subtle.ConstantTimeCompare([]byte(state), []byte(expectedState)) != 1 { ... }

Always true by construction — the guard was dead code on this path and gave a false sense of security: any reviewer would assume the state arriving over HTTP was re-validated here, when it was not.

Fix: extract a exchangeAuthorizationCode helper that performs only the /token exchange. The deeplink path calls it directly with cb.Code (state was already validated by the pending-OAuth registry lookup — only states the runtime generated and is currently awaiting are present, unknown ones return 404). The elicitation-reply path keeps its state check, because there state actually arrives from the client and is independent of expectedState.

3. Bound the OAuth flow's wait with a deadline (bot, MEDIUM)

handleUnmanagedOAuthFlow selects across:

Case When it fires (without this PR)
ctx.Done() Never — ctx is the WithoutCancel-detached ctx from clientConnector.Connect
userCancelCh Only on explicit parent-ctx cancellation by the host
callbackCh Only when a deeplink hits /api/mcp-oauth/callback
elicCh Only when requestElicitation returns

If the MCP client disconnects silently (TCP RST, idle timeout, process kill) and requestElicitation does not return promptly, the OAuth flow blocks forever. The per-session streaming lock at the SessionManager level is held until the next process restart; every subsequent message from that session returns 409 Conflict / ErrSessionBusy.

Fix: scope the elicitation goroutine to an elicCtx with a 10-minute deadline (unmanagedOAuthWaitTimeout). When the deadline fires, requestElicitation returns a wrapped context.DeadlineExceeded, the elicCh case picks it up, the OAuth flow returns, and the lock releases. An explicit <-elicCtx.Done() case in the outer select serves as a defensive fallback in case requestElicitation ever ignores its ctx. New test TestUnmanagedOAuthFlow_DriveFlow_TimesOutWhenNoReplyArrives exercises this with a 200ms override.

10 minutes is generous enough for a user to click through an IdP consent screen and any IdP-side prompts (MFA, recovery codes, …); small enough that a dead session can't strand the lock. Open to making it configurable via RuntimeConfig if needed.

4. Document the threat model on POST /api/mcp-oauth/callback (bot, MEDIUM)

The route inherits the same auth posture as the rest of the API: when --auth-token is unset, it's open. The registry's lookup-by-state IS meaningful authorization for the legitimate request (only states the runtime generated and is awaiting are accepted), but state values do appear in transit:

  • in the elicitation Meta on the session SSE stream (visible to the connected client)
  • in the authorize URL the user opens (visible to the user's browser and the authorization server)
  • in debug logs (when --debug is on)

An attacker who observes a live state via any of these channels could POST here with an attacker-controlled code; the resulting token would be bound to the attacker's account. Setting --auth-token blocks this regardless of state leakage.

The handler's doc-comment now spells out the threat model and recommends --auth-token for deployments that listen on a network-reachable interface. No code change was made: this route is no more exposed than any other API endpoint under the same group, and operators already make the --auth-token choice once for the whole API.

- Fix stale docs that still reference cagent/* meta keys; rename to
  docker-agent/* to match what the runtime emits. Also rename 'legacy'
  to 'client-driven' in docs for the mode that does not set
  --mcp-oauth-redirect-uri.

- Drop the tautological state check on the deeplink-callback path in
  handleUnmanagedOAuthFlow. Synthesizing {code, state=expectedState}
  for consumeUnmanagedElicitationReply meant the ConstantTimeCompare
  guard was always true and could mislead future readers. The
  pending-OAuth registry lookup IS the state validation for that
  path; state validation now lives only in the elicitation-reply
  branch where state actually arrives from the client. Extract a
  small exchangeAuthorizationCode helper that both paths share for
  the actual /token exchange.

- Bound the OAuth flow's wait with a 10 minute deadline. Without it,
  a silent MCP-client disconnect (TCP RST, idle timeout, process
  kill) leaves requestElicitation blocked forever and the per-session
  streaming lock held indefinitely, returning 409 / ErrSessionBusy on
  every subsequent message until a process restart. The deadline
  applies via context.WithTimeout on elicCtx; an explicit case in
  the outer select also fires as a defensive fallback if
  requestElicitation ever ignores its ctx. New test
  TestUnmanagedOAuthFlow_DriveFlow_TimesOutWhenNoReplyArrives
  exercises the path with a 200ms override.

- Document the threat model on /api/mcp-oauth/callback in the
  handler doc-comment: the registry is the per-request authorization,
  but if a state value leaks (debug logs, compromised host) an
  attacker could submit a code; --auth-token blocks this regardless
  of state leakage and SHOULD be set when the agent listens on a
  network-reachable interface.

Signed-off-by: Trung Nguyen <trung.nguyen@docker.com>
Copy link
Copy Markdown

@docker-agent docker-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟢 APPROVE

Reviewed the OAuth refactoring changes in this PR. The diff is clean — no bugs found in the added code.

What was checked:

  • exchangeAuthorizationCode extraction: Correctly splits the token-exchange step from state validation. The elicitation path retains its subtle.ConstantTimeCompare state check; the deeplink-callback path legitimately skips it (state was already validated by the pending-OAuth registry lookup). No tautology remains, and no validation gap is introduced.

  • Timeout / deadline logic: context.WithTimeout(ctx, unmanagedOAuthWaitTimeout) correctly bounds the wait. The defer elicCancel() ensures the goroutine's context is cancelled on all return paths. The elicitation goroutine writes to a buffered channel (capacity 1) so it cannot block after handleUnmanagedOAuthFlow returns — no goroutine leak.

  • Defensive <-elicCtx.Done() select case: Provides a correct safety net for implementations of requestElicitation that do not honour context cancellation. In practice the elicCh case fires first with a wrapped DeadlineExceeded; the defensive case only activates if the implementation ignores its ctx.

  • nil token after select: Both callbackCh and elicCh cases only leave token == nil when consumeErr != nil, so the early return on consumeErr != nil is always taken before token would be dereferenced.

  • resourceMetadata removal: The param was previously unused (_ = resourceMetadata). The resourceIndicator string (the actual token-exchange parameter derived from it) is still threaded through correctly to exchangeAuthorizationCode.

  • Test: TestUnmanagedOAuthFlow_DriveFlow_TimesOutWhenNoReplyArrives correctly exercises the deadline path. The global unmanagedOAuthWaitTimeout override is safe because the test does not call t.Parallel().

  • Docs: cagent/*docker-agent/* renames are consistent throughout the updated table and prose.

@aheritier aheritier added area/api For features/issues/fixes related to the usage of the cagent API area/mcp MCP protocol, MCP tool servers, integration area/security Authentication, authorization, secrets, vulnerabilities kind/fix PR fixes a bug (maps to fix: commit prefix) labels May 29, 2026
@trungutt trungutt marked this pull request as ready for review May 29, 2026 08:09
@trungutt trungutt requested a review from a team as a code owner May 29, 2026 08:09
Copy link
Copy Markdown

@docker-agent docker-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟢 APPROVE

The refactor is structurally sound. The exchangeAuthorizationCode extraction is correct — the elicitation path retains its ConstantTimeCompare state validation, while the callback path legitimately skips it because the pending-OAuth registry lookup (unknown states return 404) is the equivalent check. The buffered elicCh prevents goroutine leaks in all select arms. The elicCtx.Done() timeout guard correctly races with elicCh and both outcomes produce well-handled errors. No high or medium severity issues found — two minor defensive notes left inline.

Comment thread pkg/tools/mcp/oauth.go
)
if err != nil {
return err
if consumeErr != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] Defensive nil-token guard missing after exchangeAuthorizationCode

After both select arms (callbackCh and elicCh) complete, the post-select block only checks consumeErr != nil before dereferencing token (e.g. token.ClientID = clientID a few lines below). If exchangeAuthorizationCode were ever to return (nil, nil) — a theoretical edge case if exchangeCodeForToken's contract ever changes — execution would reach token.ClientID = ... with a nil pointer and panic.

A one-line defensive check after the error guard would eliminate this latent risk:

if consumeErr != nil {
    return consumeErr
}
if token == nil {
    return errors.New("internal: exchangeAuthorizationCode returned nil token without error")
}

This is low risk given exchangeCodeForToken's current implementation always returns an error on nil token, but the guard is cheap and makes the invariant explicit.

// would return 409 / ErrSessionBusy until a process restart.
func TestUnmanagedOAuthFlow_DriveFlow_TimesOutWhenNoReplyArrives(t *testing.T) {
original := unmanagedOAuthWaitTimeout
unmanagedOAuthWaitTimeout = 200 * time.Millisecond
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] Package-level test variable mutation is fragile if tests are ever parallelized

unmanagedOAuthWaitTimeout is a package-level var overridden directly in the test body:

unmanagedOAuthWaitTimeout = 200 * time.Millisecond
t.Cleanup(func() { unmanagedOAuthWaitTimeout = original })

This is safe today because the test does not call t.Parallel(). However, if a future contributor adds t.Parallel() to this test or any sibling test that reads unmanagedOAuthWaitTimeout, a data race is introduced and the cleanup's restore ordering becomes non-deterministic.

Consider a concurrency-safe alternative — for example, an oauthWaitTimeout field on oauthTransport (defaulting to the package constant) so each test instance sets its own value without touching shared state.

@dgageot dgageot merged commit 7797298 into docker:main May 29, 2026
11 checks passed
pull Bot pushed a commit to TheTechOddBug/cagent that referenced this pull request May 29, 2026
pull Bot pushed a commit to TheTechOddBug/cagent that referenced this pull request May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/api For features/issues/fixes related to the usage of the cagent API area/mcp MCP protocol, MCP tool servers, integration area/security Authentication, authorization, secrets, vulnerabilities kind/fix PR fixes a bug (maps to fix: commit prefix)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants