Sfoglia il codice sorgente

fix(scheduler): revert stuck queue item when printer ignores start command

  If the printer drops or ignores the MQTT project_file command (same
  half-broken-session shape as #887/#936), the queue item was permanently
  stuck in "printing" at 100% because the scheduler optimistically flipped
  the DB row right after the publish succeeded locally. A new watchdog
  polls the printer state for up to 45s after dispatch; if there's no
  transition, it reverts the item to "pending" and force-reconnects the
  MQTT session so the scheduler can retry.
maziggy 1 mese fa
parent
commit
3762022444
2 ha cambiato i file con 62 aggiunte e 0 eliminazioni
  1. 1 0
      CHANGELOG.md
  2. 61 0
      backend/app/services/print_scheduler.py

+ 1 - 0
CHANGELOG.md

@@ -19,6 +19,7 @@ All notable changes to Bambuddy will be documented in this file.
 - **Plate-Clear Confirmation Disabled by Default** — New installs ship with Settings → Workflow → "Require Plate-Clear Confirmation" off. Multiple new users reported queued prints appearing to not start because the prompt was waiting for acknowledgement; opt in from Workflow if you want the confirmation gate.
 
 ### Fixed
+- **Queue Item Stuck in "Printing" When Start Command is Dropped** ([#967](https://github.com/maziggy/bambuddy/issues/967)) — If the physical printer dropped or ignored the MQTT `project_file` start command (same half-broken-session shape as #887/#936), the queue item was permanently orphaned in the `printing` status at 100% because the scheduler optimistically flipped the DB row to `printing` right after the publish succeeded locally and had no watchdog to revert it. Recovery required manually editing the SQLite `print_queue` table. A new watchdog now captures the printer's pre-dispatch state and polls for up to 45 s after `start_print()` returns; if the printer never transitions, the item is reverted to `pending` so the scheduler picks it up again, and the MQTT session is force-reconnected so the retry lands without a printer reboot. Thanks to @stringham for reporting.
 - **Queued Prints Require Printer Reboot to Start** ([#936](https://github.com/maziggy/bambuddy/issues/936)) — On some printers, a queued print would be uploaded via FTP and the `project_file` MQTT command would be sent, but the printer never transitioned out of `FINISH`/`IDLE` and required a power cycle to unstick — after which it often started a previously cancelled print rather than the intended one. Root cause is a half-broken MQTT session (same shape as #887): the printer keeps publishing telemetry so Bambuddy reports it as connected, but our publishes on the command topic never reach the firmware. Existing recovery only triggered via the developer-mode probe path, which skips printers that already have a known `developer_mode` value. The print-dispatch verifier now treats an unacknowledged `project_file` (state unchanged after 15 s) as the same "commands not reaching printer" signal and forces a fresh MQTT session so the next dispatch can land without a printer reboot. The existing dev-mode probe path is refactored to share the same helper.
 - **Clear Plate Confirmation Bypassed on Power Cycle** ([#961](https://github.com/maziggy/bambuddy/issues/961)) — With Auto Off enabled and another job queued, the smart plug would cut power when a print finished and immediately re-power when the scheduler saw the queue, at which point the printer booted fresh into `IDLE` and the next job auto-dispatched without the "Clear Plate & Start Next" confirmation. Root cause: the plate-cleared gate lived only in the in-memory `PrinterManager._plate_cleared` set, and the scheduler's idle check treated `IDLE` as always-idle regardless of whether a previous finish had been acknowledged — so the gate was lost across both Bambuddy restarts and the IDLE-on-boot state transition. The gate is now an `awaiting_plate_clear` column on the `printers` table, set by `on_print_complete` when a print finishes or fails, cleared by the `/printers/{id}/clear-plate` endpoint and by the scheduler when it dispatches the next job, and rehydrated from the DB into `PrinterManager` on startup. `_is_printer_idle` now short-circuits to not-idle whenever `require_plate_clear` is on and the printer is awaiting ack, regardless of the currently reported state — so the prompt survives Auto Off cycles, Bambuddy restarts, and the printer booting back into `IDLE`. The clear-plate endpoint no longer requires the printer to currently report `FINISH`/`FAILED` (it accepts the ack whenever the awaiting flag is set), and the Printers page widget prompts based on the flag rather than the reported state. Thanks to @miaopas for reporting.
 - **Insecure Temp File Creation in Backup Export** — The manual backup download endpoint used `tempfile.mktemp()`, which is vulnerable to a symlink race condition (CWE-377). Replaced with `tempfile.mkstemp()` which atomically creates the file, eliminating the TOCTOU window.

+ 61 - 0
backend/app/services/print_scheduler.py

@@ -1832,6 +1832,11 @@ class PrintScheduler:
         printer_manager.set_awaiting_plate_clear(item.printer_id, False)
         logger.info("Queue item %s: Status set to 'printing', sending print command...", item.id)
 
+        # Capture state before dispatch so the watchdog can detect whether the
+        # printer actually transitioned (#967).
+        pre_status = printer_manager.get_status(item.printer_id)
+        pre_state = getattr(pre_status, "state", None) if pre_status else None
+
         # Start the print with AMS mapping, plate_id and print options
         started = printer_manager.start_print(
             item.printer_id,
@@ -1849,6 +1854,13 @@ class PrintScheduler:
         if started:
             logger.info("Queue item %s: Print started successfully - %s", item.id, filename)
 
+            # Watchdog: if the printer never transitions out of pre_state, the MQTT
+            # publish was accepted locally but didn't reach the printer (half-broken
+            # session — same shape as #887/#936). Revert the queue item so the next
+            # dispatch can pick it up instead of leaving it stuck in "printing" (#967).
+            if pre_state:
+                asyncio.create_task(self._watchdog_print_start(item.id, item.printer_id, pre_state))
+
             # Get estimated time for notification
             estimated_time = None
             if archive and archive.print_time_seconds:
@@ -1913,6 +1925,55 @@ class PrintScheduler:
 
             await self._power_off_if_needed(db, item)
 
+    @staticmethod
+    async def _watchdog_print_start(
+        queue_item_id: int,
+        printer_id: int,
+        pre_state: str,
+        timeout: float = 45.0,
+        poll_interval: float = 3.0,
+    ) -> None:
+        """Revert a queue item if the printer never acknowledges the start command.
+
+        Bambuddy optimistically marks the queue item as "printing" right after the
+        MQTT project_file publish succeeds locally. If the printer drops/ignores the
+        command (half-broken MQTT session — #887/#936), the state never transitions
+        and the item would otherwise stay stuck in "printing" forever (#967).
+        """
+        deadline = time.monotonic() + timeout
+        while time.monotonic() < deadline:
+            await asyncio.sleep(poll_interval)
+            status = printer_manager.get_status(printer_id)
+            if not status:
+                return  # Printer disconnected — don't mess with the DB
+            if status.state != pre_state:
+                return  # Printer picked up the job
+
+        # No transition. Revert the item so the scheduler can retry.
+        async with async_session() as db:
+            item = await db.get(PrintQueueItem, queue_item_id)
+            if not item or item.status != "printing":
+                return  # Already moved on (completed/cancelled/etc.)
+            item.status = "pending"
+            item.started_at = None
+            await db.commit()
+            logger.warning(
+                "Queue item %s: printer %d did not respond to print command within "
+                "%.0fs (state still %s) — reverted to 'pending' for retry (#967)",
+                queue_item_id,
+                printer_id,
+                timeout,
+                pre_state,
+            )
+
+        # Same half-broken-session recovery as background_dispatch: force the
+        # MQTT client to reconnect so the next dispatch lands without a power cycle.
+        client = printer_manager.get_client(printer_id)
+        if client and hasattr(client, "force_reconnect_stale_session"):
+            client.force_reconnect_stale_session(
+                f"queue print command unacknowledged after {timeout:.0f}s (state still {pre_state})"
+            )
+
 
 # Global scheduler instance
 scheduler = PrintScheduler()