Browse Source

Fix AMS slot changes failing until manual reconnect (#887)

  Add timeout and retry to the developer mode probe. After a keep-alive
  timeout, paho auto-reconnects but the session can be half-broken: the
  printer sends status but ignores commands. The probe had no recovery —
  one unanswered probe permanently blocked retries. Now times out after
  10s with one retry; two consecutive failures force-close the socket for
  a clean reconnect.
maziggy 1 month ago
parent
commit
2d5dd1b31f
3 changed files with 184 additions and 0 deletions
  1. 1 0
      CHANGELOG.md
  2. 42 0
      backend/app/services/bambu_mqtt.py
  3. 141 0
      backend/tests/unit/services/test_bambu_mqtt.py

+ 1 - 0
CHANGELOG.md

@@ -27,6 +27,7 @@ All notable changes to Bambuddy will be documented in this file.
 - **Thumbnails Broken After Backend Restart** — Archive and library thumbnails returned 401 Unauthorized after a backend restart because stream tokens are stored in memory and lost on restart. The frontend now detects failed token-protected image loads and automatically refreshes the stream token, so thumbnails recover without a page reload.
 - **SpoolBuddy Kiosk Screen Blanks on Boot** — The touchscreen display would blank immediately after the RPi booted, requiring a touch to wake. Added `consoleblank=0` to the kernel cmdline to disable Linux console blanking during the Plymouth-to-labwc transition, and changed the `wlr-randr` anti-blank loop to fire immediately instead of sleeping 60 seconds first.
 - **Queue Widget Ignores Plate-Clear Setting** ([#752](https://github.com/maziggy/bambuddy/issues/752)) — The "Clear Plate & Start Next" button on printer cards appeared even when "Require plate-clear confirmation" was disabled in Settings → Queue. The backend correctly auto-dispatched without waiting, but the frontend widget always showed the prompt. The widget now respects the setting and shows a passive queue link instead when plate-clear confirmation is disabled.
+- **AMS Slot Changes Fail Until Reconnect** ([#887](https://github.com/maziggy/bambuddy/issues/887)) — After a keep-alive timeout, paho-mqtt auto-reconnects but the new session can be half-broken: the printer continues sending status updates but silently ignores commands. The developer mode probe detected this (no response, leaving `developer_mode` as `null`), but had no timeout or recovery — one unanswered probe permanently blocked retries. Added a 10-second probe timeout with one retry; after two consecutive unanswered probes, Bambuddy force-closes the socket to trigger a clean reconnect with a fresh session.
 - **WebSocket Crash on Printers Without `fun` Field** ([#873](https://github.com/maziggy/bambuddy/issues/873)) — Connecting to printers that don't send the MQTT `fun` field (A1, P1 series, X1Plus firmware) caused a repeating `'str' object has no attribute 'get'` crash in the WebSocket handler, showing the printer as offline with missing AMS and SD card info. The developer mode probe introduced in 0.2.3b1 published an MQTT message inside `_update_state()` between overwriting `raw_data` with the full MQTT dict (where `vt_tray` is a raw dict) and restoring the previously normalized list — the `publish()` call released the GIL, letting the event loop read the un-normalized dict and iterate over string keys instead of spool dicts. Fixed by normalizing `vt_tray` dict→list in the MQTT data before assignment, and moving preserved field restoration before the probe. Added defensive normalization in `printer_state_to_dict` as a belt-and-suspenders guard.
 
 

+ 42 - 0
backend/app/services/bambu_mqtt.py

@@ -354,6 +354,8 @@ class BambuMQTTClient:
         # "mqtt message verify failed" → dev mode OFF, success → dev mode ON.
         self._dev_mode_probed: bool = False
         self._dev_mode_probe_seq: str | None = None
+        self._dev_mode_probe_time: float = 0.0  # monotonic timestamp when probe was sent
+        self._dev_mode_probe_failures: int = 0  # consecutive unanswered probes
 
         # Set when check_staleness() force-closes the socket to trigger reconnect.
         # Prevents _on_disconnect from redundantly broadcasting state (already done).
@@ -427,6 +429,8 @@ class BambuMQTTClient:
             self.state.developer_mode = None
             self._dev_mode_probed = False
             self._dev_mode_probe_seq = None
+            self._dev_mode_probe_time = 0.0
+            self._dev_mode_probe_failures = 0
             client.subscribe(self.topic_subscribe)
             # Subscribe to request topic for ams_mapping capture (if supported by broker)
             if self._request_topic_supported:
@@ -2441,6 +2445,42 @@ class BambuMQTTClient:
             # Only probe after a full pushall response (30+ keys) to avoid firing on
             # partial incremental updates that arrive before the pushall with "fun".
             self._probe_developer_mode()
+        elif self._dev_mode_probed and self._dev_mode_probe_seq is not None:
+            # Probe was sent but no response yet — check for timeout.
+            # A half-broken MQTT session (e.g. after keep-alive timeout reconnect)
+            # may deliver status pushes but silently drop commands (#887).
+            elapsed = time.monotonic() - self._dev_mode_probe_time
+            if elapsed > 10.0:
+                self._dev_mode_probe_failures += 1
+                logger.warning(
+                    "[%s] Developer mode probe timed out after %.0fs (attempt %d)",
+                    self.serial_number,
+                    elapsed,
+                    self._dev_mode_probe_failures,
+                )
+                self._dev_mode_probe_seq = None
+                if self._dev_mode_probe_failures >= 2:
+                    # Two consecutive unanswered probes — the MQTT session is likely
+                    # half-broken (printer sends status but ignores commands).
+                    # Force-close the socket so paho reconnects with a fresh session.
+                    logger.warning(
+                        "[%s] MQTT session appears broken (commands not reaching printer), forcing reconnect",
+                        self.serial_number,
+                    )
+                    self._stale_reconnecting = True
+                    self.state.connected = False
+                    if self.on_state_change:
+                        self.on_state_change(self.state)
+                    if self._client:
+                        try:
+                            sock = self._client.socket()
+                            if sock:
+                                sock.close()
+                        except Exception:
+                            pass
+                else:
+                    # Allow retry on next full status message
+                    self._dev_mode_probed = False
 
         # Log mapping data when received (for usage tracking debugging)
         if "mapping" in data:
@@ -2630,6 +2670,7 @@ class BambuMQTTClient:
         if not self._client or not self.state.connected:
             return
         self._dev_mode_probed = True
+        self._dev_mode_probe_time = time.monotonic()
         self._sequence_id += 1
         seq = str(self._sequence_id)
         self._dev_mode_probe_seq = seq
@@ -2666,6 +2707,7 @@ class BambuMQTTClient:
         Sets developer_mode based on whether the printer accepted or rejected the command.
         """
         self._dev_mode_probe_seq = None  # One-shot: don't match future responses
+        self._dev_mode_probe_failures = 0  # Reset on any response
         result = data.get("result", "")
         reason = data.get("reason", "")
 

+ 141 - 0
backend/tests/unit/services/test_bambu_mqtt.py

@@ -2788,6 +2788,147 @@ class TestDeveloperModeDetection:
         assert mqtt_client.state.developer_mode is False
 
 
+class TestDeveloperModeProbeTimeout:
+    """Tests for developer mode probe timeout, retry, and forced reconnect (#887).
+
+    When a printer's MQTT session is half-broken (sends status but ignores
+    commands), the developer mode probe gets no response.  The timeout logic
+    retries once, then force-closes the socket on the second failure.
+    """
+
+    @pytest.fixture
+    def mqtt_client(self):
+        from unittest.mock import MagicMock
+
+        from backend.app.services.bambu_mqtt import BambuMQTTClient
+
+        client = BambuMQTTClient(
+            ip_address="192.168.1.100",
+            serial_number="TEST123",
+            access_code="12345678",
+        )
+        # Simulate connected state with a mock MQTT client
+        client.state.connected = True
+        mock_paho = MagicMock()
+        mock_paho.socket.return_value = MagicMock()
+        client._client = mock_paho
+        return client
+
+    def _make_pushall_data(self):
+        """Create a print data dict with >30 keys (triggers probe) and no 'fun' field."""
+        return {f"key_{i}": i for i in range(35)}
+
+    def test_first_timeout_allows_retry(self, mqtt_client):
+        """After first probe timeout, _dev_mode_probed resets to allow retry."""
+        import time
+
+        data = self._make_pushall_data()
+
+        # First pushall triggers the probe
+        mqtt_client._update_state(data)
+        assert mqtt_client._dev_mode_probed is True
+        assert mqtt_client._dev_mode_probe_seq is not None
+        assert mqtt_client.state.developer_mode is None
+
+        # Simulate 11 seconds passing
+        mqtt_client._dev_mode_probe_time = time.monotonic() - 11.0
+
+        # Next status message detects the timeout
+        mqtt_client._update_state(data)
+        assert mqtt_client._dev_mode_probe_failures == 1
+        assert mqtt_client._dev_mode_probe_seq is None
+        # Should allow retry on next full message
+        assert mqtt_client._dev_mode_probed is False
+        # Connection should NOT be force-closed after 1 failure
+        assert mqtt_client.state.connected is True
+
+    def test_second_timeout_forces_reconnect(self, mqtt_client):
+        """After two consecutive probe timeouts, force-close the socket."""
+        import time
+
+        data = self._make_pushall_data()
+        state_change_called = []
+        mqtt_client.on_state_change = lambda s: state_change_called.append(True)
+
+        # First probe + timeout
+        mqtt_client._update_state(data)
+        mqtt_client._dev_mode_probe_time = time.monotonic() - 11.0
+        mqtt_client._update_state(data)
+        assert mqtt_client._dev_mode_probe_failures == 1
+
+        # Second probe (retry) + timeout
+        mqtt_client._update_state(data)  # triggers new probe
+        assert mqtt_client._dev_mode_probed is True
+        mqtt_client._dev_mode_probe_time = time.monotonic() - 11.0
+        mqtt_client._update_state(data)  # detects second timeout
+
+        assert mqtt_client._dev_mode_probe_failures == 2
+        assert mqtt_client.state.connected is False
+        assert mqtt_client._stale_reconnecting is True
+        # Socket should have been closed
+        mqtt_client._client.socket().close.assert_called()
+        # on_state_change should have been called
+        assert len(state_change_called) > 0
+
+    def test_successful_probe_resets_failure_counter(self, mqtt_client):
+        """A probe response after a previous failure resets the counter."""
+        import time
+
+        data = self._make_pushall_data()
+
+        # First probe + timeout → failure=1
+        mqtt_client._update_state(data)
+        seq = mqtt_client._dev_mode_probe_seq
+        mqtt_client._dev_mode_probe_time = time.monotonic() - 11.0
+        mqtt_client._update_state(data)
+        assert mqtt_client._dev_mode_probe_failures == 1
+
+        # Retry probe
+        mqtt_client._update_state(data)
+        new_seq = mqtt_client._dev_mode_probe_seq
+        assert new_seq is not None
+        assert new_seq != seq
+
+        # Simulate successful response
+        mqtt_client._handle_dev_mode_probe_response(
+            {
+                "command": "ams_filament_setting",
+                "sequence_id": new_seq,
+                "result": "success",
+            }
+        )
+        assert mqtt_client._dev_mode_probe_failures == 0
+        assert mqtt_client.state.developer_mode is True
+        assert mqtt_client._dev_mode_probe_seq is None
+
+    def test_no_timeout_when_probe_not_sent(self, mqtt_client):
+        """The timeout branch is only entered when a probe is pending."""
+        # No probe sent — _dev_mode_probed is False, _dev_mode_probe_seq is None
+        data = {"gcode_state": "IDLE", "mc_percent": 0}  # < 30 keys
+        mqtt_client._update_state(data)
+        assert mqtt_client._dev_mode_probe_failures == 0
+
+    def test_on_connect_resets_probe_state(self, mqtt_client):
+        """_on_connect resets all probe-related state for fresh detection."""
+        import time
+
+        mqtt_client._dev_mode_probed = True
+        mqtt_client._dev_mode_probe_seq = "42"
+        mqtt_client._dev_mode_probe_time = time.monotonic()
+        mqtt_client._dev_mode_probe_failures = 2
+        mqtt_client.state.developer_mode = True
+
+        # subscribe() must return (result, mid) tuple
+        mqtt_client._client.subscribe.return_value = (0, 1)
+        mqtt_client._on_connect(mqtt_client._client, None, None, 0)
+
+        assert mqtt_client.state.developer_mode is None
+        assert mqtt_client._dev_mode_probed is False
+        assert mqtt_client._dev_mode_probe_seq is None
+        assert mqtt_client._dev_mode_probe_time == 0.0
+        assert mqtt_client._dev_mode_probe_failures == 0
+
+
 class TestVtTrayNormalization:
     """Tests for vt_tray dict→list normalization in _update_state.