Skip to content

Reliable shutdown on Windows + lelab --stop#46

Open
nobullryder wants to merge 1 commit into
huggingface:mainfrom
nobullryder:launcher-reliability
Open

Reliable shutdown on Windows + lelab --stop#46
nobullryder wants to merge 1 commit into
huggingface:mainfrom
nobullryder:launcher-reliability

Conversation

@nobullryder

Copy link
Copy Markdown
Contributor

Problem

On Windows, stopping lelab (Ctrl-C / closing the terminal) often leaves the uvicorn and Vite child processes alive, which keeps port :8000 — and any open camera handles — held. The next lelab then fails to bind, or a camera won't open because the previous run never released it.

What this does

  • lelab --stop — finds and terminates a running LeLab and its child process tree, freeing the port.
  • Clean process-tree teardown on exit (psutil, with a Windows taskkill /T fallback) so the uvicorn/Vite children and their handles are actually released.
  • Pre-flight port checks with an actionable message ("…run lelab --stop to free it") instead of an opaque bind error, plus a readiness wait before opening the browser.

Tests

tests/test_scripts_lelab.py covers the stop / teardown / port-check paths with mocked psutil + subprocess (no real processes spawned).

🤖 Generated with Claude Code

On Windows, stopping LeLab (Ctrl-C / closing the terminal) often leaves the
uvicorn and Vite child processes alive, which keeps port :8000 — and any open
camera handles — held. The next `lelab` then fails to bind, or a camera won't
open because the previous run never released it.

- `lelab --stop`: find and terminate a running LeLab and its child process tree,
  freeing the port.
- Clean process-tree teardown on exit (psutil, with a Windows `taskkill /T`
  fallback) so the uvicorn/Vite children and their handles are actually released.
- Pre-flight port checks with an actionable message ("…run `lelab --stop` to free
  it") instead of an opaque bind error, plus a readiness wait before opening the
  browser.

Tests in tests/test_scripts_lelab.py cover the stop / teardown / port-check paths
with mocked psutil + subprocess (no real processes spawned).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread lelab/scripts/lelab.py
# asyncio Proactor loop doesn't wind down cleanly), leaving the terminal
# stuck. Take over signal handling: stop hard and reap any child
# subprocesses (training/recording/inference) so the prompt always returns.
server.install_signal_handlers = lambda: None

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hard-exit path is gated by intent (the comment is about the Windows Proactor loop hanging) but not by code — install_signal_handlers = lambda: None and the os._exit(0) below run on every platform. On macOS/Linux this drops the graceful shutdown the old prod code explicitly opted into (timeout_graceful_shutdown=2) and skips the @app.on_event("shutdown") cleanup. Can we gate this to Windows (e.g. if os.name == "nt": around the override + hard exit) and keep uvicorn's native graceful handler elsewhere?

Comment thread lelab/scripts/lelab.py
if returncode is not None:
logger.error("%s stopped with exit code %s.", name, returncode)
_shutdown_processes(processes)
raise SystemExit(returncode or 1)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returncode or 1 turns a clean child exit (returncode == 0) into exit code 1. Suggest raise SystemExit(returncode if returncode is not None else 1).

Comment thread lelab/scripts/lelab.py
def _wait_for_port(port: int, timeout: int = 30) -> bool:
for _ in range(timeout):
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
def _fail(message: str) -> None:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you annotate this -> typing.NoReturn, the type-checker knows control never returns past _fail, which lets you delete the unreachable raise AssertionError("unreachable") lines at 151 and 246 (they're only there because the return type currently looks like None).

Comment thread lelab/scripts/lelab.py
def _install_signal_handlers(processes: Sequence[tuple[str, subprocess.Popen]]) -> None:
def shutdown(_signum, _frame) -> None:
logger.info("Shutting down LeLab...")
_shutdown_processes(processes)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: this runs teardown, then the raise SystemExit(0) is caught by the except BaseException at line 430, which calls _shutdown_processes again. It's idempotent so harmless, but the handler could just raise SystemExit(0) and let the except own teardown to avoid the double pass.

@nicolas-rabault nicolas-rabault left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nobullryder for this contribution.
Here are some changes requests that could improve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants