Fix async branch fanout: offload payload out of AS args + fail loud on enqueue failure by chubes4 · Pull Request #396 · Automattic/agents-api

chubes4 · 2026-07-02T12:43:43Z

The bugs

The Action Scheduler branch executor's dispatch() packed each branch's full self-contained descriptor — including the shared immutable context (design intent, spec, brief) — inline into the AS action args, and duplicated that context into every branch. Action Scheduler enforces a hard 8,000-char limit on its args column, so on any realistically-sized workflow the per-branch payload blew past the limit and every enqueue threw. A real 6-page Site Forge fanout measured:

shared context per branch: 5,910 bytes
full branch descriptor per branch: 6,219 bytes
per-branch AS args payload: ~13,034 bytes → ~5KB over the 8,000 cap → all 6 enqueues threw.

Bug 1 — oversized inline branch payload. The inline descriptor scaled with context richness and was copied N times, breaking the async fanout on any non-trivial workflow.

Bug 2 — silent stuck-suspend. dispatch() called as_enqueue_async_action() with no size guard and no error handling. When the enqueue threw, AS's queue runner caught+logged+swallowed it, so dispatch() returned a phantom ref => 0 handle for a branch that was never enqueued. The run then suspended against a non-existent branch and hung draining an empty queue until its budget expired (observed: a 1800s stuck-suspend). The failure never surfaced to the caller.

The fixes

Store-offload (Bug 1). New table-free, option-backed WP_Agent_Workflow_Branch_Store (same no-new-tables discipline as the reconcile lock and the metadata._suspension frame). dispatch() persists each descriptor to the store and enqueues AS args that carry only a small, stable reference — { run_id, handle_id, store_ref, context_ref } — whose size does not scale with context. The shared context is stored once per run (run-scoped) rather than duplicated into every branch. The branch action rehydrates the full descriptor from the store, re-seating the run-scoped shared context, and runs exactly as before. Stored rows are released on resume (and on a failed dispatch) so no orphan option rows linger. Persistence is pluggable via wp_agent_workflow_branch_store_* filters.

Fail-loud dispatch (Bug 2). The enqueue seam now normalizes both a throw and a non-positive id to a failure, and dispatch() returns a descriptive WP_Error on any branch's enqueue failure (cleaning up already-stored rows first). The runner treats a WP_Error dispatch as a hard step failure, so the run fails fast instead of phantom-suspending. A partial dispatch (some branches enqueued, one failed) also fails cleanly rather than suspending against a partial branch set. This keeps failing loud for any future enqueue failure (AS down, etc.).

Verification

New tests/workflow-async-branch-payload-smoke.php drives the real executor + runner + reconcile/resume + store, shimming only Action Scheduler (with its real 8,000-char args guard):

Bug 1: a >8KB shared context (55,110 bytes) — asserts the AS args stay small (max 193 bytes), the full descriptor is retrievable from the store, and the branch rehydrates + runs end-to-end to SUCCEEDED with the correct aggregate. Fails before the fix (run FAILED with "args too long", 0 actions enqueued); passes after.
Bug 2: an enqueue failure — asserts dispatch() returns a WP_Error (no phantom ref=0) and the run FAILED fast, never SUSPENDED. Fails before (uncaught throw / phantom handle → silent hang in production); passes after.

All existing workflow tests stay green (workflow-as-branch, workflow-parallel-async, workflow-parallel, workflow-reconcile-race, runner/validator/lifecycle). Full composer smoke (2,973 assertions) green, composer phpstan (level max) clean, php -l clean.

Refs #390

AI assistance

AI assistance: Yes
Tool(s): Claude Code (Claude Opus 4.8)
Used for: Root-cause analysis, the store-offload + fail-loud implementation, and the regression test.

…n enqueue failure The Action Scheduler branch executor packed each branch's full self-contained descriptor -- including the shared immutable context (design intent, spec, brief) -- INLINE into the AS action args, and duplicated that context into every branch. Action Scheduler enforces a hard 8,000-char limit on its args column, so on any realistically-sized workflow the per-branch payload blew past the limit and every enqueue threw. A real 6-page fanout measured ~13KB per-branch args against the 8KB cap. Two bugs, both surfaced by that real run: Bug 1 (payload scaling). The inline descriptor scaled with context richness and was copied N times. Fix: offload the descriptor to a new table-free, option-backed branch store (WP_Agent_Workflow_Branch_Store, same no-new-tables discipline as the reconcile lock and the suspension frame). The AS args now carry only a small, stable reference -- { run_id, handle_id, store_ref, context_ref } -- whose size does not scale with context. The shared context is stored ONCE per run (run-scoped) rather than duplicated into every branch. The branch action rehydrates the full descriptor from the store, re-seating the run-scoped shared context, and runs exactly as before. Stored rows are released on resume (and on a failed dispatch) so no orphan option rows linger. Persistence is pluggable via the wp_agent_workflow_branch_store_* filters. Bug 2 (silent stuck-suspend). dispatch() called as_enqueue_async_action() with no size guard and no error handling. When the enqueue threw, AS's queue runner caught+logged+swallowed it, so dispatch() returned a phantom ref=0 handle for a branch that was never enqueued; the run then suspended against a non-existent branch and hung draining an empty queue until its budget expired. Fix: the enqueue seam now normalizes both a throw and a non-positive id to a failure, and dispatch() returns a descriptive WP_Error on ANY branch's enqueue failure (cleaning up already-stored rows). The runner treats a WP_Error dispatch as a hard step failure, so the run fails fast instead of phantom-suspending. A partial dispatch (one branch of many failed) also fails cleanly rather than suspending against a partial branch set. Adds workflow-async-branch-payload-smoke: a >8KB shared context proving the AS args stay small and the branch rehydrates + runs end-to-end (fails before the fix with "args too long"), and an enqueue-failure case proving dispatch() surfaces a WP_Error and the run fails fast rather than silently suspending. Refs #390 AI-assistance: Yes Tool: Claude Code (Claude Opus 4.8) Used-for: Root-cause analysis, the store-offload + fail-loud implementation, and the regression test.

chubes4 merged commit 644a9e9 into main Jul 2, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix async branch fanout: offload payload out of AS args + fail loud on enqueue failure#396

Fix async branch fanout: offload payload out of AS args + fail loud on enqueue failure#396
chubes4 merged 1 commit into
mainfrom
fix/async-branch-payload-and-loud-dispatch

chubes4 commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

chubes4 commented Jul 2, 2026

The bugs

The fixes

Verification

AI assistance

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant