Fix async branch fanout: offload payload out of AS args + fail loud on enqueue failure#396
Merged
Merged
Conversation
…n enqueue failure
The Action Scheduler branch executor packed each branch's full self-contained
descriptor -- including the shared immutable context (design intent, spec,
brief) -- INLINE into the AS action args, and duplicated that context into every
branch. Action Scheduler enforces a hard 8,000-char limit on its args column, so
on any realistically-sized workflow the per-branch payload blew past the limit
and every enqueue threw. A real 6-page fanout measured ~13KB per-branch args
against the 8KB cap.
Two bugs, both surfaced by that real run:
Bug 1 (payload scaling). The inline descriptor scaled with context richness and
was copied N times. Fix: offload the descriptor to a new table-free,
option-backed branch store (WP_Agent_Workflow_Branch_Store, same no-new-tables
discipline as the reconcile lock and the suspension frame). The AS args now carry
only a small, stable reference -- { run_id, handle_id, store_ref, context_ref }
-- whose size does not scale with context. The shared context is stored ONCE per
run (run-scoped) rather than duplicated into every branch. The branch action
rehydrates the full descriptor from the store, re-seating the run-scoped shared
context, and runs exactly as before. Stored rows are released on resume (and on a
failed dispatch) so no orphan option rows linger. Persistence is pluggable via
the wp_agent_workflow_branch_store_* filters.
Bug 2 (silent stuck-suspend). dispatch() called as_enqueue_async_action() with
no size guard and no error handling. When the enqueue threw, AS's queue runner
caught+logged+swallowed it, so dispatch() returned a phantom ref=0 handle for a
branch that was never enqueued; the run then suspended against a non-existent
branch and hung draining an empty queue until its budget expired. Fix: the
enqueue seam now normalizes both a throw and a non-positive id to a failure, and
dispatch() returns a descriptive WP_Error on ANY branch's enqueue failure
(cleaning up already-stored rows). The runner treats a WP_Error dispatch as a
hard step failure, so the run fails fast instead of phantom-suspending. A partial
dispatch (one branch of many failed) also fails cleanly rather than suspending
against a partial branch set.
Adds workflow-async-branch-payload-smoke: a >8KB shared context proving the AS
args stay small and the branch rehydrates + runs end-to-end (fails before the fix
with "args too long"), and an enqueue-failure case proving dispatch() surfaces a
WP_Error and the run fails fast rather than silently suspending.
Refs #390
AI-assistance: Yes
Tool: Claude Code (Claude Opus 4.8)
Used-for: Root-cause analysis, the store-offload + fail-loud implementation, and the regression test.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The bugs
The Action Scheduler branch executor's
dispatch()packed each branch's full self-contained descriptor — including the shared immutablecontext(design intent, spec, brief) — inline into the AS actionargs, and duplicated that context into every branch. Action Scheduler enforces a hard 8,000-char limit on itsargscolumn, so on any realistically-sized workflow the per-branch payload blew past the limit and every enqueue threw. A real 6-page Site Forge fanout measured:Bug 1 — oversized inline branch payload. The inline descriptor scaled with context richness and was copied N times, breaking the async fanout on any non-trivial workflow.
Bug 2 — silent stuck-suspend.
dispatch()calledas_enqueue_async_action()with no size guard and no error handling. When the enqueue threw, AS's queue runner caught+logged+swallowed it, sodispatch()returned a phantomref => 0handle for a branch that was never enqueued. The run then suspended against a non-existent branch and hung draining an empty queue until its budget expired (observed: a 1800s stuck-suspend). The failure never surfaced to the caller.The fixes
Store-offload (Bug 1). New table-free, option-backed
WP_Agent_Workflow_Branch_Store(same no-new-tables discipline as the reconcile lock and themetadata._suspensionframe).dispatch()persists each descriptor to the store and enqueues AS args that carry only a small, stable reference —{ run_id, handle_id, store_ref, context_ref }— whose size does not scale with context. The shared context is stored once per run (run-scoped) rather than duplicated into every branch. The branch action rehydrates the full descriptor from the store, re-seating the run-scoped shared context, and runs exactly as before. Stored rows are released on resume (and on a failed dispatch) so no orphan option rows linger. Persistence is pluggable viawp_agent_workflow_branch_store_*filters.Fail-loud dispatch (Bug 2). The enqueue seam now normalizes both a throw and a non-positive id to a failure, and
dispatch()returns a descriptiveWP_Erroron any branch's enqueue failure (cleaning up already-stored rows first). The runner treats aWP_Errordispatch as a hard step failure, so the run fails fast instead of phantom-suspending. A partial dispatch (some branches enqueued, one failed) also fails cleanly rather than suspending against a partial branch set. This keeps failing loud for any future enqueue failure (AS down, etc.).Verification
New
tests/workflow-async-branch-payload-smoke.phpdrives the real executor + runner + reconcile/resume + store, shimming only Action Scheduler (with its real 8,000-char args guard):SUCCEEDEDwith the correct aggregate. Fails before the fix (runFAILEDwith "args too long", 0 actions enqueued); passes after.dispatch()returns aWP_Error(no phantomref=0) and the runFAILEDfast, neverSUSPENDED. Fails before (uncaught throw / phantom handle → silent hang in production); passes after.All existing workflow tests stay green (
workflow-as-branch,workflow-parallel-async,workflow-parallel,workflow-reconcile-race, runner/validator/lifecycle). Fullcomposer smoke(2,973 assertions) green,composer phpstan(level max) clean,php -lclean.Refs #390
AI assistance