فهرست منبع

feat(diagnostics): event-loop stall watchdog to catch silent backend freezes

  Several "container hangs after adding a printer" reports (#1486) share a
  signature with nothing to act on: HTTP goes silent, /health hangs, the
  process may ignore SIGTERM, and the log just stops mid-stream - a frozen
  asyncio loop cannot log a thing.

  loop_watchdog re-arms faulthandler.dump_traceback_later() from an async
  heartbeat. While the loop ticks the timer is cancelled and re-armed before
  it can fire; if the loop stalls for 30s the heartbeat can't re-arm and
  faulthandler's C-level timer thread dumps every thread's stack to stderr,
  so the blocked frame shows up in `docker compose logs`.
maziggy 6 روز پیش
والد
کامیت
774eba73c8
4فایلهای تغییر یافته به همراه159 افزوده شده و 0 حذف شده
  1. 1 0
      CHANGELOG.md
  2. 9 0
      backend/app/main.py
  3. 77 0
      backend/app/services/loop_watchdog.py
  4. 72 0
      backend/tests/unit/services/test_loop_watchdog.py

+ 1 - 0
CHANGELOG.md

@@ -5,6 +5,7 @@ All notable changes to Bambuddy will be documented in this file.
 ## [0.2.5b1] - Unreleased
 
 ### Added
+- **Event-loop stall watchdog — makes a frozen backend self-diagnose (#1486 groundwork)** — Several "container hangs after adding a printer" reports share a signature that leaves nothing to act on: the HTTP server goes silent, `/health` hangs, the process may stop responding to SIGTERM — and the logs just stop mid-stream with no traceback, because a frozen asyncio event loop cannot log anything. New `backend/app/services/loop_watchdog.py` closes that blind spot: an async heartbeat re-arms `faulthandler.dump_traceback_later()` every 10s, always 30s ahead. While the loop ticks, the timer is cancelled and re-armed before it can fire; if the loop stalls, the heartbeat can't re-arm and faulthandler's dedicated C-level timer thread — which runs independently of the frozen loop — dumps **every thread's stack to stderr**. The blocked frame then appears in `docker compose logs`, turning an un-diagnosable freeze into a one-command capture. Started in the app lifespan after migrations, stopped cleanly on shutdown; 30s threshold is well above any legitimate on-loop operation, so a trip always means a real bug. 5 unit tests in `test_loop_watchdog.py` (arms the timer, idempotent start, stop disarms + cancels, heartbeat interval below the threshold, survives a re-arm error). Backend ruff clean; full app lifespan verified via the integration suite.
 - **Slicer: process & filament profiles filtered by the selected printer (#1325, requested by @IndividualGhost1905)** — In the server-side Slice dialog, picking a printer profile now filters the Process and Filament dropdowns to presets compatible with that printer; presets that resolve to a different Bambu model drop into a trailing "Other printers" group instead of cluttering the main list. Matching uses the slicer's own `compatible_printers` list for imported (local) presets, and falls back to the `@BBL <model>` name suffix for cloud and standard presets, so all three tiers are covered. Compatibility-unknown presets (custom or untagged) are never hidden. Defaults follow suit — the pre-picked process and per-slot filament now prefer a printer-compatible preset, and switching the printer re-picks any selection left incompatible. The printer and process dropdowns also default to the preset names embedded in the source 3MF's `project_settings.config` when those presets are available, instead of always taking the first listed preset. New `frontend/src/utils/slicerPrinterMatch.ts` (11 unit tests) and `extract_embedded_presets_from_3mf` (5 unit tests); `UnifiedPreset` now carries `compatible_printers`, exposed for the local tier (`backend/app/api/routes/slicer_presets.py`); the plates endpoints return `embedded_printer` / `embedded_process`. Parity green, build clean.
 - **Spanish (es) translation (#1243, requested by @MiguelAngelLV)** — Bambuddy now ships a full European Spanish locale. New `frontend/src/i18n/locales/es.ts` translates all 4899 keys with placeholders, plural forms, and inline markup preserved; registered in `frontend/src/i18n/index.ts` and selectable as "Español" in the language picker. The parity checker auto-discovers the file — `frontend/scripts/check-i18n-parity.mjs` gained an `ES_COGNATES` allow-list for genuine Spanish cognates and brand/format tokens. Brings the supported-language count to 9 (en / de / es / fr / it / ja / pt-BR / zh-CN / zh-TW). Parity green, frontend build clean.
 - **Currency: Belize Dollars (BZD) added to the Settings → Cost currency dropdown (#1454, requested by @PLGuerraDesigns)** — Reporter accurately tracks 3D-printing filament costs in his local currency and BZD wasn't selectable, forcing a manual 2:1 mental conversion from USD. Added `BZD: 'BZ$'` to `frontend/src/utils/currency.ts` next to MXN (Americas dollar-prefix grouping); `getCurrencySymbol('BZD')` returns `'BZ$'` and the SUPPORTED_CURRENCIES list now has 30 entries. Unit test added in `frontend/src/__tests__/utils/currency.test.ts` covering the symbol lookup and presence in SUPPORTED_CURRENCIES; entry-count assertion bumped to 30 so any future additions/removals are caught immediately. 14 currency tests green; frontend build clean.

+ 9 - 0
backend/app/main.py

@@ -4933,6 +4933,12 @@ async def lifespan(app: FastAPI):
     # L-2: Start periodic auth cleanup (stale TOTP + expired revoked JTIs)
     start_auth_cleanup()
 
+    # Event-loop stall watchdog: dumps all thread stacks to stderr if the loop
+    # freezes (#1486 — silent "container hangs after adding a printer" reports).
+    from backend.app.services.loop_watchdog import start_loop_watchdog
+
+    start_loop_watchdog()
+
     # Initialize virtual printer manager and sync from DB
     from backend.app.services.virtual_printer import virtual_printer_manager
 
@@ -4960,6 +4966,9 @@ async def lifespan(app: FastAPI):
     stop_runtime_tracking()
     stop_spoolbuddy_watchdog()
     stop_camera_cleanup()
+    from backend.app.services.loop_watchdog import stop_loop_watchdog
+
+    stop_loop_watchdog()
     # Tear down all camera fan-out broadcasters (#1089) so subscribers exit
     # cleanly rather than waiting on a queue that nothing will ever fill.
     try:

+ 77 - 0
backend/app/services/loop_watchdog.py

@@ -0,0 +1,77 @@
+"""Event-loop stall watchdog (#1486).
+
+A frozen asyncio event loop is invisible: it produces no log line and no
+traceback — the HTTP server just goes silent, ``/health`` hangs, and the
+process can stop responding to SIGTERM. Several "container hangs after adding
+a printer" reports had exactly this shape, with nothing in the logs to act on.
+
+This watchdog makes such a freeze diagnosable. An async heartbeat re-arms
+``faulthandler.dump_traceback_later()`` every ``HEARTBEAT_INTERVAL`` seconds,
+always ``STALL_THRESHOLD`` seconds ahead. While the loop keeps ticking the
+timer is cancelled and re-armed before it can fire. If the loop stalls, the
+heartbeat can't re-arm — and faulthandler's timer runs in a dedicated C-level
+thread that fires regardless of the frozen loop, dumping *every* thread's
+stack to stderr. The blocked frame then shows up in ``docker compose logs``.
+"""
+
+import asyncio
+import faulthandler
+import logging
+
+logger = logging.getLogger(__name__)
+
+# How often the heartbeat cancels + re-arms the faulthandler timer. Must be
+# comfortably below STALL_THRESHOLD so a healthy loop always re-arms in time.
+HEARTBEAT_INTERVAL = 10.0
+
+# The loop must be unresponsive for at least this long before thread stacks
+# are dumped. Generous on purpose: no legitimate on-loop operation should
+# block for 30s, so anything that does is itself a bug worth a stack dump.
+STALL_THRESHOLD = 30.0
+
+_watchdog_task: asyncio.Task | None = None
+
+
+async def _heartbeat_loop() -> None:
+    """Re-arm the faulthandler stall timer on every tick."""
+    while True:
+        try:
+            faulthandler.cancel_dump_traceback_later()
+            # repeat=False: one dump pinpoints a hard freeze. If the loop
+            # recovers and stalls again, the next heartbeat re-arms anyway.
+            faulthandler.dump_traceback_later(STALL_THRESHOLD, repeat=False)
+        except Exception as e:  # never let the watchdog itself crash the app
+            logger.warning("Loop watchdog re-arm failed: %s", e)
+        try:
+            await asyncio.sleep(HEARTBEAT_INTERVAL)
+        except asyncio.CancelledError:
+            break
+
+
+def start_loop_watchdog() -> None:
+    """Start the event-loop stall watchdog. Idempotent."""
+    global _watchdog_task
+    if _watchdog_task is not None:
+        return
+    if not faulthandler.is_enabled():
+        # Also installs handlers for fatal signals (SIGSEGV etc.) — harmless
+        # and useful; the dump_traceback_later timer works either way.
+        faulthandler.enable()
+    _watchdog_task = asyncio.create_task(_heartbeat_loop())
+    logger.info(
+        "Event-loop stall watchdog started — dumps all thread stacks to stderr if the loop stalls for more than %.0fs",
+        STALL_THRESHOLD,
+    )
+
+
+def stop_loop_watchdog() -> None:
+    """Stop the watchdog and disarm the pending stall timer."""
+    global _watchdog_task
+    if _watchdog_task is not None:
+        _watchdog_task.cancel()
+        _watchdog_task = None
+    try:
+        faulthandler.cancel_dump_traceback_later()
+    except Exception:
+        pass
+    logger.info("Event-loop stall watchdog stopped")

+ 72 - 0
backend/tests/unit/services/test_loop_watchdog.py

@@ -0,0 +1,72 @@
+"""Tests for the event-loop stall watchdog (#1486)."""
+
+import asyncio
+import faulthandler
+from unittest.mock import patch
+
+import pytest
+
+from backend.app.services import loop_watchdog
+
+
+@pytest.fixture(autouse=True)
+def _mock_faulthandler():
+    """Patch faulthandler so tests never arm a real 30s stall timer that
+    could fire mid-suite. Yields (arm_mock, cancel_mock)."""
+    with (
+        patch.object(faulthandler, "dump_traceback_later") as arm,
+        patch.object(faulthandler, "cancel_dump_traceback_later") as cancel,
+    ):
+        yield arm, cancel
+    # Safety net: make sure no test leaves the watchdog task running.
+    loop_watchdog.stop_loop_watchdog()
+
+
+async def test_start_arms_the_stall_timer(_mock_faulthandler):
+    arm, cancel = _mock_faulthandler
+    loop_watchdog.start_loop_watchdog()
+    await asyncio.sleep(0)  # let the heartbeat run its first iteration
+
+    assert cancel.called, "previous timer must be cancelled before re-arming"
+    assert arm.called
+    # Armed STALL_THRESHOLD seconds ahead, single-shot.
+    args, kwargs = arm.call_args
+    assert args[0] == loop_watchdog.STALL_THRESHOLD
+    assert kwargs.get("repeat") is False
+
+
+async def test_start_is_idempotent(_mock_faulthandler):
+    loop_watchdog.start_loop_watchdog()
+    first = loop_watchdog._watchdog_task
+    loop_watchdog.start_loop_watchdog()
+    assert loop_watchdog._watchdog_task is first, "second start must not spawn a task"
+
+
+async def test_stop_cancels_the_task_and_disarms(_mock_faulthandler):
+    _arm, cancel = _mock_faulthandler
+    loop_watchdog.start_loop_watchdog()
+    task = loop_watchdog._watchdog_task
+    assert task is not None
+
+    cancel.reset_mock()
+    loop_watchdog.stop_loop_watchdog()
+
+    assert loop_watchdog._watchdog_task is None
+    assert cancel.called, "stop must disarm the pending faulthandler timer"
+    await asyncio.sleep(0)
+    assert task.cancelled() or task.done()
+
+
+async def test_heartbeat_interval_is_below_stall_threshold():
+    """A healthy loop must always re-arm before the timer can fire."""
+    assert loop_watchdog.HEARTBEAT_INTERVAL < loop_watchdog.STALL_THRESHOLD
+
+
+async def test_rearm_failure_does_not_crash_the_watchdog(_mock_faulthandler):
+    """A faulthandler hiccup must not take down the heartbeat task."""
+    arm, _cancel = _mock_faulthandler
+    arm.side_effect = RuntimeError("boom")
+    loop_watchdog.start_loop_watchdog()
+    await asyncio.sleep(0)
+    task = loop_watchdog._watchdog_task
+    assert task is not None and not task.done(), "watchdog must survive a re-arm error"