Skip to content

fix(mcp): shut down on parent death and reliably reap downstream children#6

Open
SWSAmor wants to merge 1 commit into
mainfrom
fix/graceful-shutdown-orphan-cleanup
Open

fix(mcp): shut down on parent death and reliably reap downstream children#6
SWSAmor wants to merge 1 commit into
mainfrom
fix/graceful-shutdown-orphan-cleanup

Conversation

@SWSAmor

@SWSAmor SWSAmor commented Jun 21, 2026

Copy link
Copy Markdown
Owner

Problem

Orphaned code-executor processes and the downstream MCP children they spawn (e.g. basic-memory) survived host exit, leaving the child running and holding its SQLite lock — to the point bm doctor itself couldn't acquire the DB. Two gaps remained despite the existing stdin host-disconnect fix:

  1. Parent death was detected only via stdin EOF, which can fail to fire (host SIGKILLed, or a wrapper / an inherited fd holds the stdin pipe's write end open). There was no active parent-liveness check.
  2. disconnect() awaited client.close() with no timeout before killing children, so a wedged child (e.g. basic-memory stuck on a lock) stalled cleanup before the SIGTERM→SIGKILL loop ever ran.

Changes

  • Active parent-liveness poll — new src/mcp/parent-watcher.ts: polls process.ppid and triggers graceful shutdown when the process is reparented to init (POSIX). Complements the stdin watcher; no-op on Windows and when launched directly by init. Wired into start(), plus a new SIGHUP handler.

  • Reliable child reaping — new src/mcp/process-killer.ts killProcessGracefully(): SIGTERM → poll-for-exit → SIGKILL after a bounded grace window. A well-behaved child (dies on SIGTERM in <1s) is not delayed; only a wedged child waits out the window.

  • Bounded client.close() in client-pool.ts disconnect() so a wedged child can no longer stall the kill sequence (the root cause of surviving orphans).

  • Overall shutdown cap is now derived (drain + child-grace + buffer) so the graceful window isn't pre-empted by the old flat 35s race; the synchronous killChildrenSync() exit-handler remains the final backstop.

  • New, bounded config knobs (clamp + default rather than throw, since they're read on the shutdown path):

    Env var Default Range
    CODE_EXECUTOR_CHILD_SHUTDOWN_TIMEOUT_MS 30 000 1 000–300 000 (5 min ceiling)
    CODE_EXECUTOR_PARENT_POLL_INTERVAL_MS 2 000 500–60 000
    CODE_EXECUTOR_CLIENT_CLOSE_TIMEOUT_MS 2 000 250–30 000

Tests (TDD — tests written first)

  • Unit: parent-watcher (6), process-killer (4), shutdown-config (16) — all green.
  • Integration against the real compiled binary (tests/integration/orphan-cleanup.test.ts, opt-in — skips when the binary isn't built):
    1. stdin close (host disconnect) → downstream child reaped, binary exits 0.
    2. parent process killed → ppid poll path (stdin deliberately kept open via sleep, so only the parent-watcher can fire) → binary + child reaped.
      Runs with an isolated $HOME, so it can never connect to real MCP servers.

Verification

  • npm test: no regressions — the failing files are the pre-existing environment-dependent baseline (deno/python/network/scheduler/legacy imports), identical to main. The 26 new unit tests + 2 integration tests pass.
  • npm run typecheck clean; npm run lint clean for the touched/new files.
  • npm run build:binary succeeds; isolated stdin-EOF smoke test exits 0.

Note: running hosts pick up the rebuilt binary only after a restart.

🤖 Generated with Claude Code

…dren

Orphaned code-executor processes and the downstream MCP children they
spawned (e.g. basic-memory) survived host exit, leaving the child holding
its lock. Two gaps caused this despite the existing host-disconnect fix:

1. Parent death was detected only via stdin EOF, which can fail to fire
   (host SIGKILLed, or a wrapper/inherited fd holds the stdin pipe's write
   end open). Add an active process.ppid poll (parent-watcher) that
   triggers graceful shutdown when the process is reparented to init on
   POSIX, plus a SIGHUP handler.

2. disconnect() awaited client.close() with no timeout BEFORE killing
   children, so a wedged child stalled cleanup before the kill ran. Bound
   client.close() (CODE_EXECUTOR_CLIENT_CLOSE_TIMEOUT_MS) and replace the
   flat 2s sleep + single check with SIGTERM -> poll-for-exit -> SIGKILL
   after a configurable grace window (CODE_EXECUTOR_CHILD_SHUTDOWN_TIMEOUT_MS,
   default 30s, max 5min). Derive the overall shutdown cap from drain +
   grace + buffer so the graceful window is not pre-empted.

Timeout getters clamp/default instead of throwing, since they are read on
the shutdown path where a throw would abort the very cleanup.

Tests (TDD, tests first): unit for parent-watcher, process-killer, and the
config knobs; opt-in real-binary integration (orphan-cleanup) covering both
the stdin-close and the isolated ppid-poll paths. Integration test skips
when the binary is not built.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant