Skip to content

Windows: zombie broker + codex app-server trees accumulate — broker/shutdown acks before unbounded cleanup, and taskkill tree-kill breaks under Git Bash SHELL #416

Description

@JethroTseng

Summary

On Windows (plugin v1.0.5), zombie broker + codex app-server process trees accumulate at scale — we found 16 brokers, 16 codex app-server children, and 16 shell-wrapper processes (48 processes total) that had survived for 2 days. Every broker's named pipe was still alive and accepting connections. The visible symptom is the recurring startup/exit message:

SessionEnd hook [node "${CLAUDE_PLUGIN_ROOT}/scripts/session-lifecycle-hook.mjs" SessionEnd] failed: Hook cancelled

This is a compound failure chain. Individual links are partially covered by #288, #331, #403, #380, #410, but the ack-before-cleanup shutdown handler and the resulting child-tree leak are not reported anywhere, and it's the combination that makes brokers effectively immortal.

Environment

  • Windows 11 Pro (10.0.26200), plugin v1.0.5, Node from C:\Program Files\nodejs
  • Brokers spawned from Claude Code Bash-tool background jobs, so their environment has SHELL = Git Bash (sh.exe)

Observed evidence (verified)

  1. 16 broker processes (app-server-broker.mjs serve) created 2026-06-30 → 07-01 were still running on 07-02. Each had a matching codex app-server child — wrapped in powershell.exe -c "codex app-server" or sh.exe .../npm/codex app-server (a consequence of shell: process.env.SHELL || true in app-server.mjs:190-194).
  2. All 16 named pipes still connected successfully (tested via NamedPipeClientStream).
  3. Sending broker/shutdown manually to one zombie broker returned {"id":1,"result":{}} in ~250 ms, but the process was still alive 2+ seconds later (it exited some minutes afterwards).
  4. The SessionEnd hook, run manually against this accumulated state, took long enough to exceed the hook's own timeout: 5 (hooks.json) — Node cold start alone costs ~1–2 s on Windows, leaving almost no budget for shutdown + teardown work. Hence the recurring Hook cancelled.
  5. Because the hook is cancelled before teardown completes, stale broker.json and cxc-* session dirs persist, and the next SessionEnd repeats the same work — the failure is self-perpetuating.

Root-cause chain (code analysis, v1.0.5)

Broker side — ack-before-cleanup with unbounded await (app-server-broker.mjs:160-163):

if (message.id !== undefined && message.method === "broker/shutdown") {
  send(socket, { id: message.id, result: {} });   // 1. ack FIRST
  await shutdown(server);                          // 2. then cleanup — unbounded
  process.exit(0);
}

shutdown() awaits appClient.close(), which (app-server.mjs:232-265):

  • ends the child's stdin,
  • after 50 ms calls terminateProcessTree(this.proc.pid) and swallows all errors,
  • then does await this.exitPromise with no timeout.

The tree-kill silently fails under Git Bash SHELL (lib/process.mjs):

shell: process.platform === "win32" ? (process.env.SHELL || true) : false,

When SHELL points at Git Bash, taskkill /PID <pid> /T /F runs as bash -c "taskkill /PID ..." and MSYS converts /PID, /T, /F into paths (C:/Program Files/Git/PID, …) — taskkill errors out (same mechanism as #331, which reports it for cancel). The error is swallowed, the shell-wrapped codex app-server child never dies, exitPromise never resolves, and the broker sits forever after having acked the shutdown.

Hook side (session-lifecycle-hook.mjsbroker-lifecycle.mjs):

  • sendBrokerShutdown resolves on the ack (first data event), so the hook believes shutdown succeeded.
  • Its backstop teardownBrokerSession → terminateProcessTree(broker.pid) is the same broken taskkill, also swallowed.
  • Net result: broker + wrapper shell + codex app-server all survive every session end. Multiply by every background review/task and you get the 48-process pile above.

Suggested fixes

  1. process.mjs / app-server.mjs: never use process.env.SHELL as the spawn shell on win32 — use shell: true (cmd.exe) or, better, spawn taskkill directly with shell: false (it's an absolute-path-resolvable exe; args as array need no shell). This single change fixes both kill paths and codex-companion cancel taskkill fallback breaks under Git Bash — MSYS translates /PID arg to C:/Program Files/Git/PID #331.
  2. app-server-broker.mjs: perform cleanup before acking broker/shutdown (or ack, then process.exit(0) after a bounded cleanup — e.g. Promise.race([shutdown(server), sleep(2000)])).
  3. app-server.mjs close(): bound await this.exitPromise (e.g. 2 s), then hard-kill.
  4. hooks/hooks.json: timeout: 5 for SessionEnd leaves ~3 s of real budget after Node cold start on Windows; consider 15–30 s.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions