You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On Windows (plugin v1.0.5), zombie broker + codex app-server process trees accumulate at scale — we found 16 brokers, 16 codex app-server children, and 16 shell-wrapper processes (48 processes total) that had survived for 2 days. Every broker's named pipe was still alive and accepting connections. The visible symptom is the recurring startup/exit message:
This is a compound failure chain. Individual links are partially covered by #288, #331, #403, #380, #410, but the ack-before-cleanup shutdown handler and the resulting child-tree leak are not reported anywhere, and it's the combination that makes brokers effectively immortal.
Environment
Windows 11 Pro (10.0.26200), plugin v1.0.5, Node from C:\Program Files\nodejs
Brokers spawned from Claude Code Bash-tool background jobs, so their environment has SHELL = Git Bash (sh.exe)
Observed evidence (verified)
16 broker processes (app-server-broker.mjs serve) created 2026-06-30 → 07-01 were still running on 07-02. Each had a matching codex app-server child — wrapped in powershell.exe -c "codex app-server" or sh.exe .../npm/codex app-server (a consequence of shell: process.env.SHELL || true in app-server.mjs:190-194).
All 16 named pipes still connected successfully (tested via NamedPipeClientStream).
Sending broker/shutdown manually to one zombie broker returned {"id":1,"result":{}} in ~250 ms, but the process was still alive 2+ seconds later (it exited some minutes afterwards).
The SessionEnd hook, run manually against this accumulated state, took long enough to exceed the hook's own timeout: 5 (hooks.json) — Node cold start alone costs ~1–2 s on Windows, leaving almost no budget for shutdown + teardown work. Hence the recurring Hook cancelled.
Because the hook is cancelled before teardown completes, stale broker.json and cxc-* session dirs persist, and the next SessionEnd repeats the same work — the failure is self-perpetuating.
Root-cause chain (code analysis, v1.0.5)
Broker side — ack-before-cleanup with unbounded await (app-server-broker.mjs:160-163):
When SHELL points at Git Bash, taskkill /PID <pid> /T /F runs as bash -c "taskkill /PID ..." and MSYS converts /PID, /T, /F into paths (C:/Program Files/Git/PID, …) — taskkill errors out (same mechanism as #331, which reports it for cancel). The error is swallowed, the shell-wrapped codex app-server child never dies, exitPromise never resolves, and the broker sits forever after having acked the shutdown.
Hook side (session-lifecycle-hook.mjs → broker-lifecycle.mjs):
sendBrokerShutdown resolves on the ack (first data event), so the hook believes shutdown succeeded.
Its backstop teardownBrokerSession → terminateProcessTree(broker.pid) is the same broken taskkill, also swallowed.
Net result: broker + wrapper shell + codex app-server all survive every session end. Multiply by every background review/task and you get the 48-process pile above.
app-server-broker.mjs: perform cleanup before acking broker/shutdown (or ack, then process.exit(0) after a bounded cleanup — e.g. Promise.race([shutdown(server), sleep(2000)])).
app-server.mjs close(): bound await this.exitPromise (e.g. 2 s), then hard-kill.
hooks/hooks.json: timeout: 5 for SessionEnd leaves ~3 s of real budget after Node cold start on Windows; consider 15–30 s.
Summary
On Windows (plugin v1.0.5), zombie broker +
codex app-serverprocess trees accumulate at scale — we found 16 brokers, 16codex app-serverchildren, and 16 shell-wrapper processes (48 processes total) that had survived for 2 days. Every broker's named pipe was still alive and accepting connections. The visible symptom is the recurring startup/exit message:This is a compound failure chain. Individual links are partially covered by #288, #331, #403, #380, #410, but the ack-before-cleanup shutdown handler and the resulting child-tree leak are not reported anywhere, and it's the combination that makes brokers effectively immortal.
Environment
C:\Program Files\nodejsSHELL= Git Bash (sh.exe)Observed evidence (verified)
app-server-broker.mjs serve) created 2026-06-30 → 07-01 were still running on 07-02. Each had a matchingcodex app-serverchild — wrapped inpowershell.exe -c "codex app-server"orsh.exe .../npm/codex app-server(a consequence ofshell: process.env.SHELL || trueinapp-server.mjs:190-194).NamedPipeClientStream).broker/shutdownmanually to one zombie broker returned{"id":1,"result":{}}in ~250 ms, but the process was still alive 2+ seconds later (it exited some minutes afterwards).timeout: 5(hooks.json) — Node cold start alone costs ~1–2 s on Windows, leaving almost no budget for shutdown + teardown work. Hence the recurringHook cancelled.broker.jsonandcxc-*session dirs persist, and the next SessionEnd repeats the same work — the failure is self-perpetuating.Root-cause chain (code analysis, v1.0.5)
Broker side — ack-before-cleanup with unbounded await (
app-server-broker.mjs:160-163):shutdown()awaitsappClient.close(), which (app-server.mjs:232-265):terminateProcessTree(this.proc.pid)and swallows all errors,await this.exitPromisewith no timeout.The tree-kill silently fails under Git Bash SHELL (
lib/process.mjs):When
SHELLpoints at Git Bash,taskkill /PID <pid> /T /Fruns asbash -c "taskkill /PID ..."and MSYS converts/PID,/T,/Finto paths (C:/Program Files/Git/PID, …) — taskkill errors out (same mechanism as #331, which reports it forcancel). The error is swallowed, the shell-wrappedcodex app-serverchild never dies,exitPromisenever resolves, and the broker sits forever after having acked the shutdown.Hook side (
session-lifecycle-hook.mjs→broker-lifecycle.mjs):sendBrokerShutdownresolves on the ack (firstdataevent), so the hook believes shutdown succeeded.teardownBrokerSession → terminateProcessTree(broker.pid)is the same broken taskkill, also swallowed.codex app-serverall survive every session end. Multiply by every background review/task and you get the 48-process pile above.Suggested fixes
process.mjs/app-server.mjs: never useprocess.env.SHELLas the spawn shell on win32 — useshell: true(cmd.exe) or, better, spawntaskkilldirectly withshell: false(it's an absolute-path-resolvable exe; args as array need no shell). This single change fixes both kill paths and codex-companion cancel taskkill fallback breaks under Git Bash — MSYS translates /PID arg to C:/Program Files/Git/PID #331.app-server-broker.mjs: perform cleanup before ackingbroker/shutdown(or ack, thenprocess.exit(0)after a bounded cleanup — e.g.Promise.race([shutdown(server), sleep(2000)])).app-server.mjs close(): boundawait this.exitPromise(e.g. 2 s), then hard-kill.hooks/hooks.json:timeout: 5for SessionEnd leaves ~3 s of real budget after Node cold start on Windows; consider 15–30 s.Related
sendBrokerShutdownhas no timeout — SessionEnd hook can hang indefinitely #288 —sendBrokerShutdownhas no timeout (hook-side link of the same chain)taskkill /PIDunder Git Bash (the enabler; reported there forcancel)--freshdoes not clear them #410 — zombie job state entries