Browse Source

Fix: Add Printer no longer hangs the container on P1S (#1445)

  The pre-insert MQTT probe added in 0.2.4.2 (b51598ea) had two bugs that
  compounded on P1S firmware specifically:

  1. Fixed 2-second sleep was too short. P1S broker + TLS handshake
  routinely needs 3-5s to surface CONNACK on a cold MQTT session (same
  firmware family with the documented "broker stops publishing but TCP
  stays alive" quirk at bambu_mqtt.py:3181), so the probe falsely rejected
  a printer that would have connected fine. H2C's broker is snappier and
  cleared the 2s window without trouble — which is why the reporter's
  H2C added without issue and only the P1S misbehaved.

  2. client.disconnect() ran synchronously on the asyncio thread.
  BambuMQTTClient.disconnect() ends in paho's loop_stop() which joins
  the network thread; if that thread was still mid-TLS-handshake to the
  slow P1S socket when teardown ran, the join blocked the asyncio thread
  for as long as the handshake took to complete or fail. POST /printers
  wedged, every other HTTP request queued behind it, Docker healthcheck
  timed out — user-visible symptom: "the container hangs."

  Fix:
  - Replace the fixed sleep with a polling loop (8s budget, 200ms tick,
    early-returns the moment state.connected flips True). Slow brokers
    get the headroom they need; happy-path connects still finish in
    ~1-2s. Constants exposed as PROBE_TIMEOUT_SECONDS / PROBE_POLL_
    INTERVAL_SECONDS class attributes so tests can dial them down.
  - Move client.disconnect() to await asyncio.to_thread(...) so paho's
    thread-join can never block the event loop.

  The empty-card-report-prevention goal of the original probe stays
  intact: a genuinely wrong access code still results in connected=False
  after the 8s budget, the 400 with code=printer_connection_failed
  still fires, the row is still never persisted.
maziggy 1 week ago
parent
commit
0406487eb3

File diff suppressed because it is too large
+ 0 - 0
CHANGELOG.md


+ 25 - 3
backend/app/services/printer_manager.py

@@ -616,13 +616,31 @@ class PrinterManager:
             return self._clients[printer_id].request_status_update()
         return False
 
+    # Probe budget for test_connection (#1445). Was a fixed 2s sleep, which was
+    # too short for P1S firmware whose broker / TLS handshake routinely takes
+    # 3–5s to surface a CONNACK on a cold MQTT session. We now poll up to
+    # PROBE_TIMEOUT_SECONDS and early-return the moment we see connected=True,
+    # so happy-path connections still finish in ~1–2s and slow brokers get the
+    # headroom they need instead of getting falsely rejected.
+    PROBE_TIMEOUT_SECONDS = 8.0
+    PROBE_POLL_INTERVAL_SECONDS = 0.2
+
     async def test_connection(
         self,
         ip_address: str,
         serial_number: str,
         access_code: str,
     ) -> dict:
-        """Test connection to a printer without persisting."""
+        """Test connection to a printer without persisting.
+
+        Polls for up to PROBE_TIMEOUT_SECONDS and tears the probe client down
+        off-loop. The teardown matters: `client.disconnect()` ends in paho's
+        `loop_stop()` which `join()`s the network thread — if the thread is
+        still mid-TLS-handshake to a slow printer, that join blocks the
+        asyncio event loop and every other HTTP request queues behind it. The
+        original synchronous teardown produced the #1445 "Docker container
+        hangs" symptom on P1S when called from POST /printers/.
+        """
         client = BambuMQTTClient(
             ip_address=ip_address,
             serial_number=serial_number,
@@ -631,7 +649,9 @@ class PrinterManager:
 
         try:
             client.connect()
-            await asyncio.sleep(2)
+            deadline = asyncio.get_running_loop().time() + self.PROBE_TIMEOUT_SECONDS
+            while not client.state.connected and asyncio.get_running_loop().time() < deadline:
+                await asyncio.sleep(self.PROBE_POLL_INTERVAL_SECONDS)
 
             result = {
                 "success": client.state.connected,
@@ -639,7 +659,9 @@ class PrinterManager:
                 "model": client.state.raw_data.get("device_model"),
             }
         finally:
-            client.disconnect()
+            # Off-loop teardown — see docstring. paho's loop_stop() joins the
+            # network thread which may still be in a slow TLS handshake.
+            await asyncio.to_thread(client.disconnect)
 
         return result
 

+ 115 - 1
backend/tests/unit/services/test_printer_manager.py

@@ -584,12 +584,126 @@ class TestPrinterManager:
             mock_instance.state.connected = False
             MockClient.return_value = mock_instance
 
-            result = await manager.test_connection("192.168.1.100", "00M09A123456789", "12345678")
+            # Shorten the probe budget so the test doesn't burn the full
+            # 8-second production timeout while polling a failing connection.
+            with (
+                patch.object(manager, "PROBE_TIMEOUT_SECONDS", 0.4),
+                patch.object(manager, "PROBE_POLL_INTERVAL_SECONDS", 0.1),
+            ):
+                result = await manager.test_connection("192.168.1.100", "00M09A123456789", "12345678")
 
             assert result["success"] is False
             assert result["state"] is None
             mock_instance.disconnect.assert_called_once()
 
+    @pytest.mark.asyncio
+    async def test_test_connection_polls_and_returns_early_on_connect(self, manager):
+        """#1445: a slow printer that finishes its handshake mid-probe must
+        not be reported as a failure. Previously a fixed 2s sleep meant P1S
+        TLS / CONNACK that took 3-5s got falsely rejected; now we poll and
+        early-return as soon as connected flips True.
+        """
+        import asyncio
+        import time
+
+        with patch("backend.app.services.printer_manager.BambuMQTTClient") as MockClient:
+            mock_instance = MagicMock()
+            mock_instance.state = MagicMock()
+            mock_instance.state.connected = False  # not connected at probe start
+            mock_instance.state.state = "IDLE"
+            mock_instance.state.raw_data = {"device_model": "P1S"}
+            MockClient.return_value = mock_instance
+
+            async def flip_connected_after(delay: float):
+                await asyncio.sleep(delay)
+                mock_instance.state.connected = True
+
+            # Simulates the P1S broker finishing its slow handshake ~0.5s in,
+            # well past the old 2s-or-fail boundary's natural variance.
+            with (
+                patch.object(manager, "PROBE_TIMEOUT_SECONDS", 3.0),
+                patch.object(manager, "PROBE_POLL_INTERVAL_SECONDS", 0.05),
+            ):
+                start = time.monotonic()
+                flip_task = asyncio.create_task(flip_connected_after(0.5))
+                try:
+                    result = await manager.test_connection("192.168.1.100", "00M09A123456789", "12345678")
+                finally:
+                    await flip_task
+                elapsed = time.monotonic() - start
+
+            assert result["success"] is True
+            assert result["state"] == "IDLE"
+            # Early-return guarantee: must come back well before the configured
+            # timeout once connected flips. ~0.5s + one poll interval is plenty.
+            assert elapsed < 1.5, f"probe should have early-returned shortly after 0.5s, took {elapsed:.2f}s"
+
+    @pytest.mark.asyncio
+    async def test_test_connection_disconnect_runs_off_loop(self, manager):
+        """#1445: the root cause of the "Docker container hangs" symptom was
+        `client.disconnect()` running on the asyncio thread — paho's
+        `loop_stop()` does a thread-join that blocks until its network
+        thread exits, which on a slow P1S TLS handshake could take many
+        seconds. This test pins the off-loop teardown so a regression that
+        reintroduces sync disconnect breaks CI immediately.
+        """
+        import asyncio
+        import threading
+        import time
+
+        with patch("backend.app.services.printer_manager.BambuMQTTClient") as MockClient:
+            asyncio_thread_id = threading.get_ident()
+            disconnect_thread_ids: list[int] = []
+            disconnect_blocked_for: list[float] = []
+
+            def slow_blocking_disconnect():
+                # Mirrors paho.Client.loop_stop()'s thread-join semantics —
+                # if this runs on the asyncio thread the event loop stalls.
+                disconnect_thread_ids.append(threading.get_ident())
+                start = time.monotonic()
+                time.sleep(0.4)
+                disconnect_blocked_for.append(time.monotonic() - start)
+
+            mock_instance = MagicMock()
+            mock_instance.state = MagicMock()
+            mock_instance.state.connected = True
+            mock_instance.state.state = "IDLE"
+            mock_instance.state.raw_data = {"device_model": "P1S"}
+            mock_instance.disconnect = slow_blocking_disconnect
+            MockClient.return_value = mock_instance
+
+            # Another coroutine must keep making progress while disconnect()
+            # runs — proves the event loop was not blocked.
+            event_loop_alive_ticks = 0
+
+            async def heartbeat():
+                nonlocal event_loop_alive_ticks
+                while True:
+                    await asyncio.sleep(0.05)
+                    event_loop_alive_ticks += 1
+
+            heartbeat_task = asyncio.create_task(heartbeat())
+            try:
+                await manager.test_connection("192.168.1.100", "00M09A123456789", "12345678")
+            finally:
+                heartbeat_task.cancel()
+                try:
+                    await heartbeat_task
+                except asyncio.CancelledError:
+                    pass
+
+            # disconnect ran on a different thread than asyncio's
+            assert disconnect_thread_ids, "disconnect was never called"
+            assert disconnect_thread_ids[0] != asyncio_thread_id, (
+                "disconnect ran on the asyncio thread — this blocks the event loop (#1445)"
+            )
+            # Heartbeat made progress while the 0.4s disconnect was blocking
+            # the worker thread (proves the loop wasn't stalled).
+            assert event_loop_alive_ticks >= 3, (
+                f"event loop appears to have stalled during disconnect "
+                f"(only {event_loop_alive_ticks} heartbeats; expected >=3)"
+            )
+
     # ========================================================================
     # Tests for current print user tracking (Issue #206)
     # ========================================================================

Some files were not shown because too many files changed in this diff