Skip to content

feat(workflow-executor): use stored OAuth credentials for MCP steps [PRD-367]#1665

Draft
hercemer42 wants to merge 8 commits into
feat/prd-367-pr1-executor-oauth-credentialsfrom
feat/prd-367-pr2-executor-oauth-runtime
Draft

feat(workflow-executor): use stored OAuth credentials for MCP steps [PRD-367]#1665
hercemer42 wants to merge 8 commits into
feat/prd-367-pr1-executor-oauth-credentialsfrom
feat/prd-367-pr2-executor-oauth-runtime

Conversation

@hercemer42

@hercemer42 hercemer42 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

What & why

Executor-side runtime use of stored OAuth credentials for OAuth2-protected MCP servers (PRD-367, PR2). At an oauth2 MCP step the executor looks up the stored credential by (user, server), runs a refresh-token grant against the stored token_endpoint behind an expiry-skew cache, injects the Bearer token before connecting, retries once on an upstream 401 (covering both list-tools and the tool call), and pauses the step awaiting-input with awaitingInputReason: needs-oauth-reauth when there is no usable credential or the refresh is rejected. Bearer/none steps are unchanged.

Stacked PR / deploy ordering

  • Based on feat/prd-367-pr1-executor-oauth-credentials (PR feat(workflow-executor): add OAuth credential store + deposit endpoint (PRD-367 PR1) #1619), not main: PR2 depends on PR1's credential store / encryption / deposit endpoint, which is not yet merged. Review the diff against that branch.
  • Behaviour stays dormant until the Forest server serves authType + accepts the typed awaitingInputReason (PR1.5) and the frontend ships (PR3) — deploy the orchestrator first. Safe to deploy alone: the oauth2 path only activates once authType=oauth2 is served; bearer/none is unchanged.

Notable choices (large-PR annotation)

  • Serialization (option B): in-process mutex + DB re-read + single retry on invalid_grant, not a Postgres row lock — the executor and token endpoints can sit behind a client VPN, so no DB lock is held across the refresh HTTP call. Row lock is the documented hardening path if a strict reuse-detection provider plus real cross-process contention appears.
  • Executor isolation: the executor gains only a re-loadable tool source (reloadWithFreshAuth) plus a typed OAuthReauthRequiredError -> awaiting-input mapping; all token/credential/HTTP logic stays behind RemoteToolFetcher and the new OAuth token service.
  • Shared @forestadmin/ai-proxy change (consumed by the agent too) is additive-only (new loadToolsWithFailures / loadRemoteToolsWithFailures + mcp-auth-error export) plus behaviour-preserving delegation; authType is stripped in the McpClient constructor like id.
  • In-memory executor (dev-only) raises a ConfigurationError for oauth2 steps (no credential store wired).

Tests

294 tests across the touched suites (ai-proxy 59 + workflow-executor 235); build, lint, tsc clean; >=90% line/stmt coverage on changed files.

Known limitation

A re-auth at tool-call time leaves the step at idempotencyPhase=executing, so it needs a manual retry after re-auth; the common list-tools-time re-auth (pre-executor) resumes cleanly.

Part of PRD-367.

🤖 Generated with Claude Code

Note

Add stored OAuth credential support for MCP steps in workflow executor

  • Introduces OAuthTokenService to retrieve, cache, and refresh OAuth access tokens per (user, MCP server), with mutex-serialized refresh, rotation persistence, and OAuthReauthRequiredError on unrecoverable invalid_grant.
  • Extends RemoteToolFetcher to inject Bearer tokens for oauth2-typed MCP servers and retry tool loading with a forced token refresh on auth failure.
  • Updates McpStepExecutor to catch OAuthReauthRequiredError during both tool fetching and tool invocation, pausing the step with awaiting-input / needs-oauth-reauth rather than failing it; a single reauth retry is attempted before pausing.
  • Adds loadRemoteToolsWithFailures to McpClient, AiClient, and all adapter layers so per-server auth/connection failures are classified and surfaced without aborting the entire tool load.
  • Propagates awaitingInputReason through McpStepOutcomeSchema and step-outcome-to-update-step-mapper so the reason reaches the server update request.
📊 Macroscope summarized 0e05efe. 21 files reviewed, 0 issues evaluated, 0 issues filtered, 0 comments posted

🗂️ Filtered Issues

No issues evaluated.

@linear-code

linear-code Bot commented Jun 16, 2026

Copy link
Copy Markdown

PRD-367

PRD-624

});

if (result.refreshToken) {
await this.persistRotatedRefreshToken(credential, result.refreshToken);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Opus 4.8 — [Should fix] On the invalid_grant retry path, runGrantWithRotationRetry re-reads latest and grants with it, but this write-back passes the pre-retry credential for the non-token fields (clientId / clientSecret / tokenEndpoint / scopes). refreshTokenEnc and encKeyVersion are correct (from the fresh encrypt), so this is not token/key corruption — but if a peer concurrently re-deposited new config, those fields get reverted: a narrow lost-update on the exact multi-instance path this logic guards. Thread the credential actually used for the successful grant through to the write-back.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Opus 4.8 — resolved (2b5e5e5). Agreed. runGrantWithRotationRetry now returns the credential whose token produced the grant, and refreshAndCache writes the rotated token back onto that (current) row — so a concurrent re-deposit is no longer partially reverted. Added a regression test asserting the write-back uses the re-read credential fields.

Comment thread packages/ai-proxy/src/mcp-auth-error.ts Outdated
// Classifies errors surfaced while connecting to or calling an MCP server. The MCP SDK / HTTP
// transport reports failures in several shapes (a numeric status field, or only a message string),
// so the checks walk the cause chain and inspect both structured status and the message text.
const AUTH_STATUSES = new Set([401, 403]);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Opus 4.8 — [Should fix] 403 is treated like 401, so a 403 on the tool call forces a refresh and (on a second 403) pauses with needs-oauth-reauth. The spec specifies retry on 401; a 403 is usually insufficient-scope, which re-consent often cannot fix, so the user re-auths back into the same 403. Confirm 403 should drive the reauth flow, or narrow the refresh/reauth trigger to 401 and surface 403 as a normal failure. (Classifying 403 as kind: auth is fine — this is only about the reauth trigger.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Opus 4.8 — resolved (8d441d2). Narrowed to 401 per the spec: 403 is no longer treated as a refreshable auth error, so it surfaces as an ordinary failure instead of forcing a refresh + reauth loop. Treating a scope-403 as a re-consent opportunity is a separate, out-of-scope enhancement. isMcpAuthError / classifyMcpLoadError now match 401/unauthorized only; tests updated.

Comment thread packages/ai-proxy/src/mcp-auth-error.ts Outdated
// transport reports failures in several shapes (a numeric status field, or only a message string),
// so the checks walk the cause chain and inspect both structured status and the message text.
const AUTH_STATUSES = new Set([401, 403]);
const AUTH_PATTERN = /\b40[13]\b|unauthorized|forbidden/i;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Opus 4.8 — [Preferential] This message-text fallback can match a bare 401/403 appearing for non-auth reasons in an error body (IDs, counts, timestamps). It only runs when no structured status field is present (those are checked first), so the blast radius is small — flagging as a known fuzzy edge, not a blocker.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Opus 4.8 — acknowledged, keeping as-is. Deliberate last-resort fallback, used only when no structured status field is present (code/status/statusCode are checked first), so the misclassification window is narrow and low-probability. Tightening the message regex risks false negatives on real 401s that carry the status only in text. Leaving it as the documented fuzzy edge.

hercemer42 added a commit that referenced this pull request Jun 16, 2026
On the invalid_grant concurrent-rotation retry path the write-back used the pre-retry credential for the non-token fields; thread the credential whose token produced the grant through so a concurrent re-deposit is not partially reverted. Addresses review on #1665.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hercemer42 added a commit that referenced this pull request Jun 16, 2026
A 403 is a permission/scope failure that a token refresh or re-consent cannot resolve, so it no longer triggers the refresh + re-auth flow (which looped) and instead surfaces as an ordinary failure. The spec specifies retry on 401. Addresses review on #1665.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hercemer42 added a commit that referenced this pull request Jun 16, 2026
On the invalid_grant concurrent-rotation retry path the write-back used the pre-retry credential for the non-token fields; thread the credential whose token produced the grant through so a concurrent re-deposit is not partially reverted. Addresses review on #1665.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hercemer42 added a commit that referenced this pull request Jun 16, 2026
A 403 is a permission/scope failure that a token refresh or re-consent cannot resolve, so it no longer triggers the refresh + re-auth flow (which looped) and instead surfaces as an ordinary failure. The spec specifies retry on 401. Addresses review on #1665.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hercemer42 hercemer42 force-pushed the feat/prd-367-pr2-executor-oauth-runtime branch from 8d441d2 to 0e05efe Compare June 16, 2026 16:04
@qltysh

qltysh Bot commented Jun 16, 2026

Copy link
Copy Markdown

10 new issues

Tool Category Rule Count
qlty Structure Function with many parameters (count = 4): constructor 4
qlty Structure Function with high complexity (count = 12): invokeWithReauthRetry 3
qlty Structure Function with many returns (count = 9): create 3

Comment thread packages/workflow-executor/src/oauth/refresh-grant.ts
@qltysh

qltysh Bot commented Jun 16, 2026

Copy link
Copy Markdown

Qlty


Coverage Impact

⬆️ Merging this pull request will increase total coverage on feat/prd-367-pr1-executor-oauth-credentials by 0.02%.

Modified Files with Diff Coverage (20)

RatingFile% DiffUncovered Line #s
Coverage rating: A Coverage rating: A
packages/workflow-executor/src/runner.ts100.0%
Coverage rating: A Coverage rating: A
packages/workflow-executor/src/executors/step-executor-factory.ts100.0%
Coverage rating: A Coverage rating: A
packages/workflow-executor/src/build-workflow-executor.ts100.0%
Coverage rating: A Coverage rating: A
packages/workflow-executor/src/remote-tool-fetcher.ts100.0%
Coverage rating: A Coverage rating: A
packages/workflow-executor/src/http/executor-http-server.ts94.1%332
Coverage rating: B Coverage rating: B
packages/workflow-executor/src/adapters/server-ai-adapter.ts100.0%
Coverage rating: A Coverage rating: A
packages/workflow-executor/src/types/validated/step-outcome.ts100.0%
Coverage rating: A Coverage rating: A
packages/ai-proxy/src/mcp-client.ts100.0%
Coverage rating: C Coverage rating: C
packages/workflow-executor/src/adapters/ai-client-adapter.ts100.0%
Coverage rating: A Coverage rating: A
packages/ai-proxy/src/ai-client.ts100.0%
Coverage rating: A Coverage rating: A
packages/workflow-executor/src/errors.ts100.0%
Coverage rating: A Coverage rating: A
packages/workflow-executor/src/executors/mcp-step-executor.ts100.0%
Coverage rating: A Coverage rating: A
...ow-executor/src/adapters/step-outcome-to-update-step-mapper.ts100.0%
Coverage rating: A Coverage rating: A
...s/workflow-executor/src/adapters/always-error-ai-model-port.ts100.0%
Coverage rating: A Coverage rating: A
packages/workflow-executor/src/defaults.ts100.0%
Coverage rating: A Coverage rating: A
packages/ai-proxy/src/index.ts100.0%
New Coverage rating: A
packages/workflow-executor/src/oauth/refresh-grant.ts100.0%
New Coverage rating: A
packages/workflow-executor/src/oauth/token-service.ts98.2%147
New Coverage rating: A
packages/ai-proxy/src/mcp-auth-error.ts92.0%25-27
New Coverage rating: A
packages/workflow-executor/src/oauth/keyed-mutex.ts100.0%
Total98.3%
🤖 Increase coverage with AI coding...
In the `feat/prd-367-pr2-executor-oauth-runtime` branch, add test coverage for this new code:

- `packages/ai-proxy/src/mcp-auth-error.ts` -- Line 25-27
- `packages/workflow-executor/src/http/executor-http-server.ts` -- Line 332
- `packages/workflow-executor/src/oauth/token-service.ts` -- Line 147

🚦 See full report on Qlty Cloud »

🛟 Help
  • Diff Coverage: Coverage for added or modified lines of code (excludes deleted files). Learn more.

  • Total Coverage: Coverage for the whole repository, calculated as the sum of all File Coverage. Learn more.

  • File Coverage: Covered Lines divided by Covered Lines plus Missed Lines. (Excludes non-executable lines including blank lines and comments.)

    • Indirect Changes: Changes to File Coverage for files that were not modified in this PR. Learn more.

Comment thread packages/workflow-executor/src/oauth/token-service.ts
hercemer42 added a commit that referenced this pull request Jun 17, 2026
On the invalid_grant concurrent-rotation retry path the write-back used the pre-retry credential for the non-token fields; thread the credential whose token produced the grant through so a concurrent re-deposit is not partially reverted. Addresses review on #1665.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hercemer42 added a commit that referenced this pull request Jun 17, 2026
A 403 is a permission/scope failure that a token refresh or re-consent cannot resolve, so it no longer triggers the refresh + re-auth flow (which looped) and instead surfaces as an ordinary failure. The spec specifies retry on 401. Addresses review on #1665.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hercemer42 hercemer42 force-pushed the feat/prd-367-pr2-executor-oauth-runtime branch from d4555ee to cf71d69 Compare June 17, 2026 07:41
hercemer42 added a commit that referenced this pull request Jun 18, 2026
On the invalid_grant concurrent-rotation retry path the write-back used the pre-retry credential for the non-token fields; thread the credential whose token produced the grant through so a concurrent re-deposit is not partially reverted. Addresses review on #1665.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hercemer42 added a commit that referenced this pull request Jun 18, 2026
A 403 is a permission/scope failure that a token refresh or re-consent cannot resolve, so it no longer triggers the refresh + re-auth flow (which looped) and instead surfaces as an ordinary failure. The spec specifies retry on 401. Addresses review on #1665.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hercemer42 hercemer42 force-pushed the feat/prd-367-pr2-executor-oauth-runtime branch from cf71d69 to 55309c7 Compare June 18, 2026 13:28
@hercemer42 hercemer42 force-pushed the feat/prd-367-pr1-executor-oauth-credentials branch from d42828e to 30b063c Compare June 23, 2026 09:49
hercemer42 and others added 7 commits June 23, 2026 15:56
At an oauth2 MCP step the executor looks up the stored credential by (user, server), runs the refresh-token grant against the stored token endpoint behind an expiry-skew cache, injects the bearer token before connecting, retries once on a 401 across list-tools and the tool call, and pauses for re-authentication when no usable credential exists or the refresh is rejected. Bearer and none steps are unchanged. Adds additive auth-error classification to the shared ai-proxy McpClient consumed by this path. Behaviour stays dormant until the orchestrator serves authType and the frontend ships (deploy orchestrator first), so it is safe to deploy alone. Depends on the PR1 credential store.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On the invalid_grant concurrent-rotation retry path the write-back used the pre-retry credential for the non-token fields; thread the credential whose token produced the grant through so a concurrent re-deposit is not partially reverted. Addresses review on #1665.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A 403 is a permission/scope failure that a token refresh or re-consent cannot resolve, so it no longer triggers the refresh + re-auth flow (which looped) and instead surfaces as an ordinary failure. The spec specifies retry on 401. Addresses review on #1665.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cutor [PRD-367]

PR1 wired an in-memory credential store + deposit endpoint into buildInMemoryExecutor, so the previous "in-memory raises ConfigurationError for oauth2 steps" behavior was inconsistent: a credential could be deposited but never used. Wire an OAuthTokenService into the in-memory runner (sharing the same store instance the deposit endpoint writes to) so oauth2 steps work end-to-end in dev, matching the database executor.

The token service is now a required RunnerConfig/RemoteToolFetcher collaborator (both executors provide it), so the unreachable ConfigurationError guard and its fetcher test are removed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tials port [PRD-367]

The PR1 rebase moved the credentials store interface + types from stores/mcp-oauth-credentials-store to ports/mcp-oauth-credentials-store (the store file now holds only the Database/InMemory implementations). Import McpOAuthCredentialsStore and StoredMcpOAuthCredential from the new port path so the package compiles against the rebased PR1 base.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion [PRD-367]

PR1 dropped enc_key_version from the credential store (no version-aware decrypt path), so the rotation write-back no longer carries encKeyVersion. Per PRD-367 key-rotation handling, a decrypt failure with the encryption key PRESENT (auth-tag mismatch from a since-rotated/hard-swapped key) is recoverable: toGrantParams now classifies it as OAuthReauthRequiredError (needs-oauth-reauth) so re-consent re-deposits under the new key. A missing key (ExecutorEncryptionKeyMissingError) stays terminal — re-consent cannot help and a re-deposit would 503.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ting [PRD-367]

New GET /list-mcp-tools?mcpServerId= route on the executor HTTP server for the orchestrator-engine MCP-server details page: resolves the caller's vault credential (user_id from the validated JWT, never the request), refreshes, injects the Bearer, and returns the server's tool definitions — reusing RemoteToolFetcher, no new fetch/refresh logic. A missing/unrefreshable credential returns a typed needs-oauth-reauth (409), not a generic error or empty list. Wired into both the database and in-memory executors so oauth2 tool listing works in dev too.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hercemer42 hercemer42 force-pushed the feat/prd-367-pr2-executor-oauth-runtime branch from 55309c7 to 842d468 Compare June 23, 2026 15:09
@hercemer42

Copy link
Copy Markdown
Contributor Author

Agent code review — Claude Opus 4.8 (2026-06-18)

Reviewed the folded-in commits 7bf9357b (key-rotation re-consent) and 842d4680 (GET /list-mcp-tools) via the pr-review-toolkit code-reviewer; the runtime/max-run path was covered in the earlier review pass. No issues found.

Verified:

  • list-mcp-toolsuser_id is taken from the validated JWT (never the request); the mcpServerId query is validated (400 if absent). Responses leak no secrets (200 → {name, description} per tool; 409 → {awaitingInputReason, mcpServerId}). The typed needs-oauth-reauth 409 is set on ctx directly so it survives the error middleware (which would otherwise remap OAuthReauthRequiredError → 400).
  • Key-rotation classification (toGrantParams) — the try wraps only the decrypt, so a missing key stays terminal (ExecutorEncryptionKeyMissingError rethrown) and a key-present decrypt failure → needs-oauth-reauth; the grant call sits outside the try, so invalid_grant/transient/5xx errors cannot be misclassified into a re-consent loop.
  • Domain error classes (no raw Error), no PRD refs in comments, tool serialization type-safe.

…nses [PRD-367]

A literal JSON null (or other non-object) body from the token endpoint overwrote the {} parse default, so the subsequent payload.error / payload.access_token reads threw a TypeError instead of the typed OAuthRefreshError. Keep the {} default for non-object bodies so the status checks still surface a typed OAuthRefreshError.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant