Просмотр исходного кода

fix(obico): exclude snapshot capture PIDs from stream cleanup (#172)

  The periodic camera cleanup task scans /proc for ffmpeg processes and
  kills any not in the active-streams registry. The Obico detection
  service's capture_camera_frame_bytes() spawns short-lived ffmpeg for
  snapshots but never registered the PID — so cleanup killed it as
  "orphaned" mid-capture (SIGKILL, exit -9), producing false errors and
  missed detection frames.

  Track capture PIDs in _active_capture_pids and exclude them from the
  cleanup kill list.
maziggy 1 месяц назад
Родитель
Сommit
ef37ffa7c7

+ 1 - 0
CHANGELOG.md

@@ -31,6 +31,7 @@ All notable changes to Bambuddy will be documented in this file.
 - **Camera Popup Shows "Valid camera stream token required" With Auth Enabled** ([#979](https://github.com/maziggy/bambuddy/issues/979)) — When Camera View Mode was set to "Window" and authentication was enabled, clicking the camera button opened a popup that immediately failed with `"Valid camera stream token required"`, while the embedded overlay kept working. Two root causes: (1) `window.open(...)` passed `noopener` in the popup features, which severed the opener link and prevented the browser from copying sessionStorage (where the auth token lives) into the popup — so the new window booted unauthenticated and the `POST /printers/camera/stream-token` fetch returned 401, leaving the `<img>` src without the required `?token=` query param; (2) even once the token arrived, `CameraPage` computed its URL from the module-level stream-token cache on render and never re-rendered when the cache was updated in a `useEffect`, so the first paint locked in a tokenless URL that the backend kept rejecting. Fixed by dropping `noopener` from the camera popup features (same-origin, trusted window) so sessionStorage is inherited, subscribing `CameraPage` to the `camera-stream-token` React Query so it re-renders the moment the token resolves, and appending the token directly from the reactive query value instead of the effect-synced module cache — the `<img>` src stays empty until the token is ready, so no tokenless request ever leaves the popup. Embedded-overlay mode was unaffected. Thanks to @VREmma for the reproducer.
 - **Obico ML API Got 401 When Fetching Snapshot with Auth Enabled** ([#172](https://github.com/maziggy/bambuddy/issues/172)) — The Obico failure-detection service handed the ML API container a snapshot URL (`/api/v1/printers/{id}/camera/snapshot`) for it to `GET` directly, but when Bambuddy authentication was enabled the endpoint returned 401 and the ML API surfaced "Failed to get image" (visible as a 400 from the ML API back to Bambuddy). The detection service now appends a short-lived camera-stream token to the snapshot URL — the same token scheme already used by `<img>`-based camera consumers, which the snapshot endpoint already accepts. The token is cached on the service and refreshed before its 60-minute expiry, so no extra per-call DB churn. When auth is disabled the token is simply ignored. Thanks to @fblix for reporting.
 - **Obico Detection Timed Out on Slow Snapshot** ([#172](https://github.com/maziggy/bambuddy/issues/172)) — Second wave of #172 — once the auth/401 fix landed, `fblix` reported a new failure mode: `Read timed out. (read timeout=5)`. Bambuddy's `/camera/snapshot` endpoint takes 5–10 s on cold calls (it spins up a TLS proxy + ffmpeg + waits for the next RTSP keyframe), but Obico's ML API `server.py` has a hardcoded `timeout = (0.1, 5)` on the URL it fetches — any snapshot that doesn't return within 5 s gets the entire detection cycle reported as "Failed to get image". Raising the timeout in the user's container was only a workaround; every Obico container would have the same ceiling. Fixed by flipping the flow around: the detection loop now captures the JPEG *locally* (with a long 20 s timeout that we control), stashes the bytes under a random 32-byte single-use nonce, and hands Obico's ML API a new `/api/v1/obico/cached-frame/{nonce}` URL that returns the pre-captured bytes instantly. Obico's 5 s timeout no longer races the capture pipeline — its fetch is a pure in-memory lookup. The nonce is the credential (URL-safe, 256 bits of entropy), single-use (popped on read), and expires in 30 s, so the endpoint can be unauthenticated without widening the camera access surface. External cameras (MJPEG/RTSP/HTTP snapshot) are captured via the same `capture_frame` helper used by the snapshot endpoint. Thanks to @fblix for the detailed reproducer with the exact timeout numbers.
+- **Obico Detection Snapshot Killed by Stream Cleanup** ([#172](https://github.com/maziggy/bambuddy/issues/172)) — Third wave of #172 — once the cached-frame fix landed, `fblix` reported a permanent "Failed to capture snapshot" warning in the UI. The periodic camera stream cleanup task scans `/proc` for ffmpeg processes with Bambu RTSP URLs and kills any that aren't in the active-streams registry. The Obico detection service's `capture_camera_frame_bytes()` spawns its own short-lived ffmpeg process to grab a single JPEG frame, but that process was never registered with the stream cleanup — so when the 60-second cleanup cycle happened to run during the 5–10 s capture window, it killed the ffmpeg as "orphaned" (exit code -9). The detection service recovered on the next poll, but the kill produced unnecessary error logs and a missed detection frame. Fixed by tracking capture PIDs in a module-level set (`_active_capture_pids`) and excluding them from the `/proc`-scan kill list. Thanks to @fblix for the detailed timing analysis.
 - **Direct Print from Library Not Attributed to User** — Clicking the Print button on a library file dispatched the job with no `created_by_id`, so the resulting archive had no owner and the print didn't show up in per-user statistics. The Queue and Reprint paths already forwarded the authenticated user; the library `POST /files/{file_id}/print` endpoint now does the same, reading the user from the JWT and passing it through to the dispatcher so direct prints are attributed like queued and reprinted ones.
 - **Add/Edit Printer Modal Clipped on Short Viewports** ([#964](https://github.com/maziggy/bambuddy/issues/964)) — On short or zoomed-in browser windows, the Add Printer and Edit Printer dialogs exceeded the viewport height with no scroll, hiding the lower fields (Access Code, Model, Location) and the Save button. Users had to zoom the browser out to complete the form. The modal overlay now scrolls and the card caps at `calc(100vh - 2rem)` with internal overflow so every field stays reachable regardless of viewport height. Thanks to @MartinNYHC for reporting.
 - **AMS Drying Silently Does Nothing** ([#971](https://github.com/maziggy/bambuddy/issues/971)) — Clicking Start Drying on a supported printer (e.g. P1S with AMS 2 Pro) could publish the MQTT command successfully but leave the AMS idle with no UI feedback. Two issues: (1) the firmware rejects the command when `dry_sf_reason` reports a blocking state (most commonly code 8 — AMS 2 Pro external power adapter not plugged in — but also "AMS busy", "already drying", etc.), and Bambuddy parsed that array but never surfaced it to the user; (2) the payload sent `filament: ""`, which some firmwares treat as an invalid-field refusal. The `/drying/start` endpoint now inspects the live `dry_sf_reason` for the target AMS unit and returns a descriptive 409 (e.g. "Plug in the external AMS power adapter to start drying") instead of silently publishing, and backfills an empty `filament` from the first loaded tray's type (defaulting to `PLA`) so the printer never rejects the command for a missing field. Thanks to @MartinNYHC for reporting.

+ 5 - 0
backend/app/api/routes/camera.py

@@ -1376,6 +1376,11 @@ async def cleanup_orphaned_streams():
     # Collect PIDs that are legitimately in-use (active stream, process alive)
     active_pids = {proc.pid for proc in _active_streams.values() if proc.returncode is None}
 
+    # Also exclude PIDs from one-shot snapshot captures (Obico detection, finish photos, etc.)
+    from backend.app.services.camera import _active_capture_pids
+
+    active_pids |= _active_capture_pids
+
     # 1. /proc scan — catch ALL orphaned Bambu ffmpeg processes on the system.
     #    Any ffmpeg with rtsp(s)://bblp: that is NOT in an active stream is orphaned.
     for pid in _scan_bambu_ffmpeg_pids():

+ 8 - 0
backend/app/services/camera.py

@@ -24,6 +24,10 @@ JPEG_END = b"\xff\xd9"
 # Cache the ffmpeg path after first lookup
 _ffmpeg_path: str | None = None
 
+# Track PIDs of ffmpeg processes spawned for one-shot frame capture (snapshot).
+# The cleanup task in routes/camera.py checks this set to avoid killing active captures.
+_active_capture_pids: set[int] = set()
+
 
 def get_ffmpeg_path() -> str | None:
     """Find the ffmpeg executable path.
@@ -506,6 +510,7 @@ async def capture_camera_frame_bytes(
 
     logger.info("Capturing camera frame bytes from %s using RTSP (model: %s)", ip_address, model)
 
+    process = None
     try:
         process = await asyncio.create_subprocess_exec(
             *cmd,
@@ -513,6 +518,7 @@ async def capture_camera_frame_bytes(
             stderr=asyncio.subprocess.PIPE,
         )
 
+        _active_capture_pids.add(process.pid)
         try:
             stdout, stderr = await asyncio.wait_for(process.communicate(), timeout=timeout)
         except TimeoutError:
@@ -536,6 +542,8 @@ async def capture_camera_frame_bytes(
         logger.exception("Camera frame bytes capture failed: %s", e)
         return None
     finally:
+        if process is not None:
+            _active_capture_pids.discard(process.pid)
         proxy_server.close()
         await proxy_server.wait_closed()
 

+ 204 - 0
backend/tests/unit/test_capture_pid_tracking.py

@@ -0,0 +1,204 @@
+"""Tests for capture PID tracking and cleanup exclusion (#172).
+
+The Obico detection service spawns short-lived ffmpeg processes for snapshot
+capture via capture_camera_frame_bytes(). These must be registered in
+_active_capture_pids so the cleanup task in routes/camera.py does not kill
+them as orphaned.
+"""
+
+import asyncio
+from unittest.mock import AsyncMock, MagicMock, patch
+
+import pytest
+
+from backend.app.services.camera import (
+    _active_capture_pids,
+    capture_camera_frame_bytes,
+)
+
+
+@pytest.fixture(autouse=True)
+def _clear_capture_pids():
+    """Ensure _active_capture_pids is empty before/after each test."""
+    _active_capture_pids.clear()
+    yield
+    _active_capture_pids.clear()
+
+
+class TestCapturePidRegistration:
+    """Verify PIDs are added/removed from _active_capture_pids."""
+
+    @pytest.mark.asyncio
+    async def test_pid_registered_during_capture(self):
+        """PID is in _active_capture_pids while ffmpeg is running."""
+        observed_pids_during_run: set[int] = set()
+
+        fake_process = MagicMock()
+        fake_process.pid = 99999
+        fake_process.returncode = 0
+
+        async def fake_communicate():
+            # Snapshot what's in the set while "ffmpeg is running"
+            observed_pids_during_run.update(_active_capture_pids)
+            return (b"\xff\xd8" + b"\x00" * 200 + b"\xff\xd9", b"")
+
+        fake_process.communicate = fake_communicate
+
+        fake_proxy_server = AsyncMock()
+        fake_proxy_server.close = MagicMock()
+
+        with (
+            patch("backend.app.services.camera.is_chamber_image_model", return_value=False),
+            patch("backend.app.services.camera.get_camera_port", return_value=322),
+            patch("backend.app.services.camera.create_tls_proxy", return_value=(12345, fake_proxy_server)),
+            patch("backend.app.services.camera.get_ffmpeg_path", return_value="/usr/bin/ffmpeg"),
+            patch("asyncio.create_subprocess_exec", return_value=fake_process),
+        ):
+            result = await capture_camera_frame_bytes("192.168.1.1", "test", "P2S", timeout=10)
+
+        # PID was registered during capture
+        assert 99999 in observed_pids_during_run
+        # PID is removed after capture completes
+        assert 99999 not in _active_capture_pids
+        # Capture returned data
+        assert result is not None
+
+    @pytest.mark.asyncio
+    async def test_pid_removed_after_failure(self):
+        """PID is cleaned up even when ffmpeg returns non-zero."""
+        fake_process = MagicMock()
+        fake_process.pid = 88888
+        fake_process.returncode = 1
+
+        async def fake_communicate():
+            return (b"", b"some error")
+
+        fake_process.communicate = fake_communicate
+
+        fake_proxy_server = AsyncMock()
+        fake_proxy_server.close = MagicMock()
+
+        with (
+            patch("backend.app.services.camera.is_chamber_image_model", return_value=False),
+            patch("backend.app.services.camera.get_camera_port", return_value=322),
+            patch("backend.app.services.camera.create_tls_proxy", return_value=(12345, fake_proxy_server)),
+            patch("backend.app.services.camera.get_ffmpeg_path", return_value="/usr/bin/ffmpeg"),
+            patch("asyncio.create_subprocess_exec", return_value=fake_process),
+        ):
+            result = await capture_camera_frame_bytes("192.168.1.1", "test", "P2S", timeout=10)
+
+        assert result is None
+        assert 88888 not in _active_capture_pids
+
+    @pytest.mark.asyncio
+    async def test_pid_removed_after_timeout(self):
+        """PID is cleaned up when ffmpeg times out."""
+        fake_process = MagicMock()
+        fake_process.pid = 77777
+        fake_process.returncode = None
+        fake_process.kill = MagicMock()
+
+        async def fake_communicate():
+            await asyncio.sleep(60)  # Will be cancelled by wait_for
+            return (b"", b"")
+
+        fake_process.communicate = fake_communicate
+
+        async def fake_wait():
+            fake_process.returncode = -9
+
+        fake_process.wait = fake_wait
+
+        fake_proxy_server = AsyncMock()
+        fake_proxy_server.close = MagicMock()
+
+        with (
+            patch("backend.app.services.camera.is_chamber_image_model", return_value=False),
+            patch("backend.app.services.camera.get_camera_port", return_value=322),
+            patch("backend.app.services.camera.create_tls_proxy", return_value=(12345, fake_proxy_server)),
+            patch("backend.app.services.camera.get_ffmpeg_path", return_value="/usr/bin/ffmpeg"),
+            patch("asyncio.create_subprocess_exec", return_value=fake_process),
+        ):
+            result = await capture_camera_frame_bytes("192.168.1.1", "test", "P2S", timeout=0.01)
+
+        assert result is None
+        assert 77777 not in _active_capture_pids
+
+    @pytest.mark.asyncio
+    async def test_no_pid_tracked_for_chamber_image_models(self):
+        """Chamber image models (A1/P1) don't spawn ffmpeg — no PID tracking."""
+        with (
+            patch("backend.app.services.camera.is_chamber_image_model", return_value=True),
+            patch("backend.app.services.camera.read_chamber_image_frame", return_value=b"\xff\xd8test\xff\xd9"),
+        ):
+            result = await capture_camera_frame_bytes("192.168.1.1", "test", "A1", timeout=10)
+
+        assert result is not None
+        assert len(_active_capture_pids) == 0
+
+    @pytest.mark.asyncio
+    async def test_no_pid_tracked_when_subprocess_fails(self):
+        """If create_subprocess_exec raises, process is None — no PID to track."""
+        fake_proxy_server = AsyncMock()
+        fake_proxy_server.close = MagicMock()
+
+        with (
+            patch("backend.app.services.camera.is_chamber_image_model", return_value=False),
+            patch("backend.app.services.camera.get_camera_port", return_value=322),
+            patch("backend.app.services.camera.create_tls_proxy", return_value=(12345, fake_proxy_server)),
+            patch("backend.app.services.camera.get_ffmpeg_path", return_value="/usr/bin/ffmpeg"),
+            patch("asyncio.create_subprocess_exec", side_effect=FileNotFoundError("ffmpeg")),
+        ):
+            result = await capture_camera_frame_bytes("192.168.1.1", "test", "P2S", timeout=10)
+
+        assert result is None
+        assert len(_active_capture_pids) == 0
+
+
+class TestCleanupExcludesCapturePids:
+    """Verify cleanup_orphaned_streams skips PIDs in _active_capture_pids."""
+
+    @pytest.mark.asyncio
+    async def test_cleanup_skips_capture_pids(self):
+        """A PID in _active_capture_pids must not be killed by cleanup."""
+        from backend.app.api.routes.camera import cleanup_orphaned_streams
+
+        _active_capture_pids.add(42000)
+
+        with (
+            patch("backend.app.api.routes.camera._scan_bambu_ffmpeg_pids", return_value=[42000]),
+            patch("backend.app.api.routes.camera._active_streams", {}),
+            patch("backend.app.api.routes.camera._spawned_ffmpeg_pids", {}),
+            patch("os.kill") as mock_kill,
+        ):
+            await cleanup_orphaned_streams()
+
+        # os.kill should NOT have been called with SIGKILL for our capture PID
+        for call in mock_kill.call_args_list:
+            pid, sig = call[0]
+            assert pid != 42000, "cleanup killed an active capture PID"
+
+    @pytest.mark.asyncio
+    async def test_cleanup_kills_non_capture_pids(self):
+        """PIDs NOT in _active_capture_pids should still be killed."""
+        import signal
+
+        from backend.app.api.routes.camera import cleanup_orphaned_streams
+
+        # 42000 is a capture PID, 43000 is truly orphaned
+        _active_capture_pids.add(42000)
+
+        with (
+            patch("backend.app.api.routes.camera._scan_bambu_ffmpeg_pids", return_value=[42000, 43000]),
+            patch("backend.app.api.routes.camera._active_streams", {}),
+            patch("backend.app.api.routes.camera._spawned_ffmpeg_pids", {}),
+            patch("os.kill") as mock_kill,
+        ):
+            await cleanup_orphaned_streams()
+
+        # 43000 should have been killed
+        mock_kill.assert_any_call(43000, signal.SIGKILL)
+
+        # 42000 should NOT have been killed with SIGKILL
+        killed_pids = [call[0][0] for call in mock_kill.call_args_list if call[0][1] == signal.SIGKILL]
+        assert 42000 not in killed_pids