Skip to content

Fix async branch fanout: offload payload out of AS args + fail loud on enqueue failure#396

Merged
chubes4 merged 1 commit into
mainfrom
fix/async-branch-payload-and-loud-dispatch
Jul 2, 2026
Merged

Fix async branch fanout: offload payload out of AS args + fail loud on enqueue failure#396
chubes4 merged 1 commit into
mainfrom
fix/async-branch-payload-and-loud-dispatch

Conversation

@chubes4

@chubes4 chubes4 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

The bugs

The Action Scheduler branch executor's dispatch() packed each branch's full self-contained descriptor — including the shared immutable context (design intent, spec, brief) — inline into the AS action args, and duplicated that context into every branch. Action Scheduler enforces a hard 8,000-char limit on its args column, so on any realistically-sized workflow the per-branch payload blew past the limit and every enqueue threw. A real 6-page Site Forge fanout measured:

  • shared context per branch: 5,910 bytes
  • full branch descriptor per branch: 6,219 bytes
  • per-branch AS args payload: ~13,034 bytes → ~5KB over the 8,000 cap → all 6 enqueues threw.

Bug 1 — oversized inline branch payload. The inline descriptor scaled with context richness and was copied N times, breaking the async fanout on any non-trivial workflow.

Bug 2 — silent stuck-suspend. dispatch() called as_enqueue_async_action() with no size guard and no error handling. When the enqueue threw, AS's queue runner caught+logged+swallowed it, so dispatch() returned a phantom ref => 0 handle for a branch that was never enqueued. The run then suspended against a non-existent branch and hung draining an empty queue until its budget expired (observed: a 1800s stuck-suspend). The failure never surfaced to the caller.

The fixes

Store-offload (Bug 1). New table-free, option-backed WP_Agent_Workflow_Branch_Store (same no-new-tables discipline as the reconcile lock and the metadata._suspension frame). dispatch() persists each descriptor to the store and enqueues AS args that carry only a small, stable reference — { run_id, handle_id, store_ref, context_ref } — whose size does not scale with context. The shared context is stored once per run (run-scoped) rather than duplicated into every branch. The branch action rehydrates the full descriptor from the store, re-seating the run-scoped shared context, and runs exactly as before. Stored rows are released on resume (and on a failed dispatch) so no orphan option rows linger. Persistence is pluggable via wp_agent_workflow_branch_store_* filters.

Fail-loud dispatch (Bug 2). The enqueue seam now normalizes both a throw and a non-positive id to a failure, and dispatch() returns a descriptive WP_Error on any branch's enqueue failure (cleaning up already-stored rows first). The runner treats a WP_Error dispatch as a hard step failure, so the run fails fast instead of phantom-suspending. A partial dispatch (some branches enqueued, one failed) also fails cleanly rather than suspending against a partial branch set. This keeps failing loud for any future enqueue failure (AS down, etc.).

Verification

New tests/workflow-async-branch-payload-smoke.php drives the real executor + runner + reconcile/resume + store, shimming only Action Scheduler (with its real 8,000-char args guard):

  • Bug 1: a >8KB shared context (55,110 bytes) — asserts the AS args stay small (max 193 bytes), the full descriptor is retrievable from the store, and the branch rehydrates + runs end-to-end to SUCCEEDED with the correct aggregate. Fails before the fix (run FAILED with "args too long", 0 actions enqueued); passes after.
  • Bug 2: an enqueue failure — asserts dispatch() returns a WP_Error (no phantom ref=0) and the run FAILED fast, never SUSPENDED. Fails before (uncaught throw / phantom handle → silent hang in production); passes after.

All existing workflow tests stay green (workflow-as-branch, workflow-parallel-async, workflow-parallel, workflow-reconcile-race, runner/validator/lifecycle). Full composer smoke (2,973 assertions) green, composer phpstan (level max) clean, php -l clean.

Refs #390

AI assistance

  • AI assistance: Yes
  • Tool(s): Claude Code (Claude Opus 4.8)
  • Used for: Root-cause analysis, the store-offload + fail-loud implementation, and the regression test.

…n enqueue failure

The Action Scheduler branch executor packed each branch's full self-contained
descriptor -- including the shared immutable context (design intent, spec,
brief) -- INLINE into the AS action args, and duplicated that context into every
branch. Action Scheduler enforces a hard 8,000-char limit on its args column, so
on any realistically-sized workflow the per-branch payload blew past the limit
and every enqueue threw. A real 6-page fanout measured ~13KB per-branch args
against the 8KB cap.

Two bugs, both surfaced by that real run:

Bug 1 (payload scaling). The inline descriptor scaled with context richness and
was copied N times. Fix: offload the descriptor to a new table-free,
option-backed branch store (WP_Agent_Workflow_Branch_Store, same no-new-tables
discipline as the reconcile lock and the suspension frame). The AS args now carry
only a small, stable reference -- { run_id, handle_id, store_ref, context_ref }
-- whose size does not scale with context. The shared context is stored ONCE per
run (run-scoped) rather than duplicated into every branch. The branch action
rehydrates the full descriptor from the store, re-seating the run-scoped shared
context, and runs exactly as before. Stored rows are released on resume (and on a
failed dispatch) so no orphan option rows linger. Persistence is pluggable via
the wp_agent_workflow_branch_store_* filters.

Bug 2 (silent stuck-suspend). dispatch() called as_enqueue_async_action() with
no size guard and no error handling. When the enqueue threw, AS's queue runner
caught+logged+swallowed it, so dispatch() returned a phantom ref=0 handle for a
branch that was never enqueued; the run then suspended against a non-existent
branch and hung draining an empty queue until its budget expired. Fix: the
enqueue seam now normalizes both a throw and a non-positive id to a failure, and
dispatch() returns a descriptive WP_Error on ANY branch's enqueue failure
(cleaning up already-stored rows). The runner treats a WP_Error dispatch as a
hard step failure, so the run fails fast instead of phantom-suspending. A partial
dispatch (one branch of many failed) also fails cleanly rather than suspending
against a partial branch set.

Adds workflow-async-branch-payload-smoke: a >8KB shared context proving the AS
args stay small and the branch rehydrates + runs end-to-end (fails before the fix
with "args too long"), and an enqueue-failure case proving dispatch() surfaces a
WP_Error and the run fails fast rather than silently suspending.

Refs #390

AI-assistance: Yes
Tool: Claude Code (Claude Opus 4.8)
Used-for: Root-cause analysis, the store-offload + fail-loud implementation, and the regression test.
@chubes4 chubes4 merged commit 644a9e9 into main Jul 2, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant