소스 검색

Fix MQTT connected/disconnected UI bouncing caused by stale reconnect

  The stale connection fix (6ac375a) force-closed the socket to trigger
  paho reconnect, but paho's disconnect callback redundantly broadcast
  state changes, causing rapid connected/disconnected cycling in the UI.
  Now sets a flag before the force-close so _on_disconnect skips the
  redundant broadcast. Also relaxed keepalive from 15s to 30s — the
  aggressive interval caused spurious disconnects on transient hiccups.

  Moved _disconnection_event.set() to the top of _on_disconnect so no
  early-return filter can prevent it from unblocking disconnect() callers.

  Removed leftover debug payload dump.
maziggy 2 달 전
부모
커밋
6c2df27860
3개의 변경된 파일200개의 추가작업 그리고 10개의 파일을 삭제
  1. 1 1
      CHANGELOG.md
  2. 32 9
      backend/app/services/bambu_mqtt.py
  3. 167 0
      backend/tests/unit/services/test_bambu_mqtt.py

+ 1 - 1
CHANGELOG.md

@@ -57,7 +57,7 @@ All notable changes to Bambuddy will be documented in this file.
 - **SpoolBuddy Daemon Reports Stale Version** — The SpoolBuddy daemon maintained its own hardcoded `__version__` that was never bumped to `0.2.3b1`, causing the update check to incorrectly show an update from `0.2.2b1` to the latest release. Fixed by reading the version at import time from the backend's `APP_VERSION` in `backend/app/core/config.py` — the single source of truth — so the daemon version is always in sync.
 - **SpoolBuddy Update Columns Missing from Database** — The OTA update feature added `update_status` and `update_message` to the device model but was missing the database migration, causing "no such column" errors on existing installations.
 - **Queue Print Command Not Reaching Printer** ([#778](https://github.com/maziggy/bambuddy/issues/778)) — When a queue item targeted a specific printer and the scheduler's power-on-wait loop triggered, each reconnection attempt created a new MQTT client that re-attempted subscribing to the request topic. On printers whose broker rejects this subscription (e.g. A1), this caused repeated connect/disconnect cycles for up to 170 seconds, leaving the MQTT connection in a fragile state where the print command could silently fail to reach the printer. Fixed by caching request topic support state per serial number at the class level, so new client instances skip the subscription immediately instead of rediscovering the rejection. Reported by @RubenKremer.
-- **Stale MQTT Connection Not Recovering** ([#813](https://github.com/maziggy/bambuddy/issues/813)) — When a printer's MQTT connection went stale (no messages for 60+ seconds), Bambuddy marked it as disconnected but did not force the underlying TCP socket closed, so paho-mqtt's auto-reconnect never triggered and print commands were silently published into a dead connection. Additionally, the spurious-disconnect filter (designed to ignore false disconnect callbacks from paho) used a 30-second window that could suppress real keepalive-timeout disconnects (which fire at ~22.5s with keepalive=15s). Fixed by: (1) never suppressing error disconnects (`rc.is_failure`) regardless of timing, and tightening the spurious filter window from 30s to 10s; (2) force-closing the socket on stale detection so paho's loop thread detects the break and auto-reconnects. Reported by @inkdawgz.
+- **Stale MQTT Connection Not Recovering** ([#813](https://github.com/maziggy/bambuddy/issues/813)) — When a printer's MQTT connection went stale (no messages for 60+ seconds), Bambuddy marked it as disconnected but did not force the underlying TCP socket closed, so paho-mqtt's auto-reconnect never triggered and print commands were silently published into a dead connection. Fixed by force-closing the socket on stale detection so paho's loop thread detects the break and auto-reconnects. The initial fix caused rapid connected/disconnected bouncing in the UI because the force-close triggered paho's disconnect callback which redundantly broadcast state changes; now uses a flag to suppress the redundant callback. Also relaxed MQTT keepalive from 15s to 30s — the aggressive 15s keepalive caused spurious disconnects on transient network hiccups. Error disconnects (`rc.is_failure`) are never suppressed by the spurious-disconnect filter. Additionally, the disconnect event used by `disconnect()` is now fired unconditionally at the top of the callback so that no early-return filter can prevent it from unblocking callers. Reported by @inkdawgz.
 - **P1S/P1P Printer Card Shows "Printing" When Idle** ([#813](https://github.com/maziggy/bambuddy/issues/813)) — Some P1S and P1P firmware versions report `stg_cur=0` when idle, which maps to the "Printing" stage name and overrides the correct "Idle" gcode_state on the printer card. The System Info page was unaffected because it displays the raw gcode_state. Extended the existing A1/A1 Mini workaround for this firmware bug to also cover P1S and P1P models. Reported by @inkdawgz.
 - **AMS Slot Search Shows Unrelated Profiles** ([#681](https://github.com/maziggy/bambuddy/issues/681)) — Searching for a non-existent filament profile in the AMS slot configuration showed unrelated profiles instead of an empty result. The saved preset bypassed the search filter entirely, so stale mappings (e.g. a slot previously configured with "Bambu PLA Matte" that now holds a Silk spool) would always appear regardless of the search query. The saved preset now only bypasses the printer model filter, not the search filter. Reported by @RosdasHH.
 - **Virtual Printer FTP Routed to Wrong VP** ([#735](https://github.com/maziggy/bambuddy/issues/735)) — When running multiple virtual printers with different access codes on separate bind IPs, FTP connections were routed to the wrong VP. Root cause: the iptables `REDIRECT` rule rewrites the destination IP to the incoming interface's primary address, so all FTP traffic went to the first VP regardless of the intended target. Fix: FTP server now binds directly to port 990 (standard implicit FTPS), eliminating the need for iptables redirect. Requires `CAP_NET_BIND_SERVICE` (already set in the systemd service and Docker image). Also removed a global `set_exception_handler()` in the MQTT server that caused spurious error messages when running multiple VPs. See `docs/migration-vp-ftp-port.md` for migration steps. Reported by @VREmma.

+ 32 - 9
backend/app/services/bambu_mqtt.py

@@ -344,6 +344,10 @@ class BambuMQTTClient:
         self._request_topic_sub_time: float = 0.0
         self._request_topic_confirmed: bool = False
 
+        # Set when check_staleness() force-closes the socket to trigger reconnect.
+        # Prevents _on_disconnect from redundantly broadcasting state (already done).
+        self._stale_reconnecting: bool = False
+
     @property
     def topic_subscribe(self) -> str:
         return f"device/{self.serial_number}/report"
@@ -375,6 +379,9 @@ class BambuMQTTClient:
             # the broken connection and triggers auto-reconnect.  We don't call
             # client.disconnect() because that's a clean disconnect and paho
             # would NOT auto-reconnect afterwards.
+            # Set flag so _on_disconnect knows this was intentional and skips
+            # redundant state broadcast (we already set connected=False above).
+            self._stale_reconnecting = True
             if self._client:
                 try:
                     sock = self._client.socket()
@@ -387,6 +394,7 @@ class BambuMQTTClient:
     def _on_connect(self, client, userdata, flags, rc, properties=None):
         if rc == 0:
             self.state.connected = True
+            self._stale_reconnecting = False  # Clear stale-reconnect flag on successful connect
             # Reset per-connection warning state so warnings fire once per (re)connection
             self._ams_version_warned = set()
             client.subscribe(self.topic_subscribe)
@@ -444,6 +452,23 @@ class BambuMQTTClient:
             self._request_topic_sub_time = 0.0
 
     def _on_disconnect(self, client, userdata, disconnect_flags=None, rc=None, properties=None):
+        # Always unblock disconnect() callers, regardless of whether we suppress
+        # the state broadcast below.  disconnect() sets _disconnection_event and
+        # waits on it — every callback path must fire it.
+        if self._disconnection_event:
+            self._disconnection_event.set()
+
+        # If we intentionally closed the socket for stale reconnect, don't broadcast
+        # another state change — check_staleness() already set connected=False and
+        # notified the UI.  Just log and let paho auto-reconnect.
+        if self._stale_reconnecting:
+            logger.info(
+                "[%s] Disconnect callback after stale reconnect (expected), rc=%s",
+                self.serial_number,
+                rc,
+            )
+            return
+
         # Ignore spurious disconnect callbacks if we've received a message recently
         # Paho-mqtt sometimes fires disconnect callbacks while the connection is still active.
         # BUT: never suppress error disconnects (keepalive timeout, connection lost, etc.)
@@ -478,8 +503,6 @@ class BambuMQTTClient:
         self.state.connected = False
         if self.on_state_change:
             self.on_state_change(self.state)
-        if self._disconnection_event:
-            self._disconnection_event.set()
 
     def _on_message(self, client, userdata, msg):
         try:
@@ -505,10 +528,6 @@ class BambuMQTTClient:
                 self._handle_request_message(payload)
                 return
 
-            # TEMP: Dump full payload once to find extruder state field
-            if not hasattr(self, "_payload_dumped"):
-                self._payload_dumped = True
-                logger.debug("[%s] FULL MQTT PAYLOAD DUMP:\n%s", self.serial_number, json.dumps(payload, indent=2))
             # Log message if logging is enabled
             if self._logging_enabled:
                 self._message_log.append(
@@ -2628,9 +2647,13 @@ class BambuMQTTClient:
         ssl_context.verify_mode = ssl.CERT_NONE
         self._client.tls_set_context(ssl_context)
 
-        # Use shorter keepalive (15s) for faster disconnect detection
-        # Paho considers connection lost after 1.5x keepalive with no response
-        self._client.connect_async(self.ip_address, self.MQTT_PORT, keepalive=15)
+        # Keepalive: paho sends PINGREQs at this interval, broker considers
+        # client dead at 1.5x.  30s is a good balance — fast enough to detect
+        # real network loss (45s), not so aggressive that transient hiccups
+        # trigger false disconnects.  Stale detection (60s no messages) handles
+        # the P1S/P1P firmware bug where the broker stops publishing but the
+        # TCP connection stays alive.
+        self._client.connect_async(self.ip_address, self.MQTT_PORT, keepalive=30)
         self._client.loop_start()
 
     def start_print(

+ 167 - 0
backend/tests/unit/services/test_bambu_mqtt.py

@@ -2974,3 +2974,170 @@ class TestStartPrintAmsMapping:
         cmd = self._get_published_command(mqtt_client)
         assert "ams_mapping" not in cmd
         assert "ams_mapping2" not in cmd
+
+
+class TestStaleReconnect:
+    """Tests for stale connection detection and reconnect without UI bouncing."""
+
+    @pytest.fixture
+    def mqtt_client(self):
+        from backend.app.services.bambu_mqtt import BambuMQTTClient
+
+        client = BambuMQTTClient(
+            ip_address="192.168.1.100",
+            serial_number="TEST_STALE",
+            access_code="12345678",
+        )
+        return client
+
+    def test_check_staleness_sets_flag_and_broadcasts_once(self, mqtt_client):
+        """check_staleness() should set connected=False, broadcast, and set _stale_reconnecting."""
+        import time
+
+        state_changes = []
+        mqtt_client.on_state_change = lambda s: state_changes.append(s.connected)
+        mqtt_client.state.connected = True
+        mqtt_client._last_message_time = time.time() - 120  # well past 60s threshold
+
+        result = mqtt_client.check_staleness()
+
+        assert result is False
+        assert mqtt_client.state.connected is False
+        assert mqtt_client._stale_reconnecting is True
+        assert state_changes == [False]  # Exactly one broadcast
+
+    def test_check_staleness_noop_when_not_connected(self, mqtt_client):
+        """check_staleness() should not set flag when already disconnected."""
+        import time
+
+        mqtt_client.state.connected = False
+        mqtt_client._last_message_time = time.time() - 120
+
+        mqtt_client.check_staleness()
+
+        assert mqtt_client._stale_reconnecting is False
+
+    def test_check_staleness_noop_when_not_stale(self, mqtt_client):
+        """check_staleness() should not set flag when messages are recent."""
+        import time
+
+        mqtt_client.state.connected = True
+        mqtt_client._last_message_time = time.time() - 5  # 5s ago, well within 60s
+
+        result = mqtt_client.check_staleness()
+
+        assert result is True
+        assert mqtt_client.state.connected is True
+        assert mqtt_client._stale_reconnecting is False
+
+    def test_on_disconnect_skipped_during_stale_reconnect(self, mqtt_client):
+        """_on_disconnect should not broadcast state when _stale_reconnecting is set."""
+        state_changes = []
+        mqtt_client.on_state_change = lambda s: state_changes.append(s.connected)
+        mqtt_client._stale_reconnecting = True
+        mqtt_client.state.connected = False
+
+        mqtt_client._on_disconnect(None, None)
+
+        # No state change broadcast — check_staleness() already did it
+        assert state_changes == []
+        assert mqtt_client.state.connected is False
+
+    def test_on_disconnect_fires_event_during_stale_reconnect(self, mqtt_client):
+        """_on_disconnect must still fire _disconnection_event even during stale reconnect.
+
+        If disconnect() is called while _stale_reconnecting is True (e.g. user removes
+        the printer before paho reconnects), the event must fire so disconnect() doesn't hang.
+        """
+        import threading
+
+        mqtt_client._stale_reconnecting = True
+        mqtt_client._disconnection_event = threading.Event()
+
+        mqtt_client._on_disconnect(None, None)
+
+        assert mqtt_client._disconnection_event.is_set()
+
+    def test_on_connect_clears_stale_reconnecting_flag(self, mqtt_client):
+        """_on_connect should clear _stale_reconnecting and restore connected=True."""
+        mqtt_client._stale_reconnecting = True
+        mqtt_client.state.connected = False
+
+        subscribe_calls = []
+        mock_client = type(
+            "MockClient",
+            (),
+            {
+                "subscribe": lambda self, topic: subscribe_calls.append(topic) or (0, 1),
+            },
+        )()
+
+        mqtt_client._on_connect(mock_client, None, None, 0)
+
+        assert mqtt_client._stale_reconnecting is False
+        assert mqtt_client.state.connected is True
+
+    def test_full_stale_reconnect_cycle_no_bounce(self, mqtt_client):
+        """Full cycle: stale → disconnect callback → reconnect. UI should see exactly one disconnect."""
+        import time
+
+        state_changes = []
+        mqtt_client.on_state_change = lambda s: state_changes.append(s.connected)
+        mqtt_client.state.connected = True
+        mqtt_client._last_message_time = time.time() - 120
+
+        # Step 1: Stale detection triggers
+        mqtt_client.check_staleness()
+        assert state_changes == [False]
+
+        # Step 2: Paho fires disconnect callback (from socket close)
+        mqtt_client._on_disconnect(None, None)
+        # Should NOT add another state change
+        assert state_changes == [False]
+
+        # Step 3: Paho reconnects
+        subscribe_calls = []
+        mock_client = type(
+            "MockClient",
+            (),
+            {
+                "subscribe": lambda self, topic: subscribe_calls.append(topic) or (0, 1),
+            },
+        )()
+        mqtt_client._on_connect(mock_client, None, None, 0)
+        assert state_changes == [False, True]  # Now connected again
+        assert mqtt_client._stale_reconnecting is False
+
+    def test_spurious_disconnect_suppressed_when_recent_messages(self, mqtt_client):
+        """Non-error disconnect with recent messages should be suppressed."""
+        import time
+
+        state_changes = []
+        mqtt_client.on_state_change = lambda s: state_changes.append(s.connected)
+        mqtt_client.state.connected = True
+        mqtt_client._last_message_time = time.time() - 3  # 3s ago
+
+        # Non-error disconnect (rc=None)
+        mqtt_client._on_disconnect(None, None)
+
+        assert state_changes == []
+        assert mqtt_client.state.connected is True
+
+    def test_error_disconnect_not_suppressed_despite_recent_messages(self, mqtt_client):
+        """Error disconnect should always be processed, even with recent messages."""
+        import time
+
+        import paho.mqtt.client as mqtt
+        from paho.mqtt.reasoncodes import ReasonCode
+
+        state_changes = []
+        mqtt_client.on_state_change = lambda s: state_changes.append(s.connected)
+        mqtt_client.state.connected = True
+        mqtt_client._last_message_time = time.time() - 3  # 3s ago
+
+        # Error disconnect (rc.is_failure = True)
+        rc = ReasonCode(mqtt.CONNACK >> 4, identifier=0x80)  # Failure code
+        mqtt_client._on_disconnect(None, None, rc=rc)
+
+        assert state_changes == [False]
+        assert mqtt_client.state.connected is False