One Command to Patch Every Guest

I have two reliable habits for keeping the homelab patched. When Proxmox flags updates, I open the Updates tab. When a Docker service falls behind, Komodo redeploys it. For months I assumed that covered everything, and then one evening I went looking for the third habit and realised it had never existed.

Most of the homelab is neither Docker nor the hypervisor. It’s a dozen-odd LXC containers, each a full Debian install with its own packages: Pi-hole, Traefik, the internal CA, the git server, the backup server, and a handful more. The Proxmox Updates tab patches the hypervisor those containers run on. It does nothing to the containers themselves. There is no button anywhere that means “update all my guests”, so the honest state of things was that I never did.

The gap

The way updates actually split up looks like this:

flowchart LR
    subgraph layers["Three things that need patching"]
        direction TB
        host["🖥️ Proxmox hosts\ncorvus + corax"]
        docker["📦 Docker services\non the VM"]
        guests["🐧 14 LXC guests + 1 VM\neach its own Debian"]
    end

    host -->|"PVE Updates tab"| ok1["✅ handled"]
    docker -->|"Komodo redeploy"| ok2["✅ handled"]
    guests -->|"???"| gap["⚠️ nobody"]

The first two were handled. The third was the biggest by host count and the one I’d quietly ignored. “Patch all my guests” means SSHing into each box and running apt update && apt full-upgrade, fifteen times over, which is the kind of repetitive job nobody does by hand for long. So I didn’t.

The obvious tool

This is a textbook job for Ansible, which has the convenient side effect of being the thing I’m trying to get good at right now. The shape is simple: an inventory of every guest, and a play that connects over SSH and runs the upgrade. The apt module does the work, and because it reports whether anything actually changed, I get a clean per-host result instead of fifteen screens of scrolling apt output.

The core of the play is genuinely this small:

- name: Patch OS on all guests
  hosts: guests
  become: true
  serial: 3                       # three at a time, not all at once
  environment:
    DEBIAN_FRONTEND: noninteractive
    NEEDRESTART_MODE: a           # auto-restart services, never prompt
  tasks:
    - name: apt update + full-upgrade + autoremove
      ansible.builtin.apt:
        update_cache: true
        upgrade: full
        autoremove: true

Two choices in there earn their place. serial: 3 patches three hosts at a time instead of the whole fleet at once, so a bad upgrade can break three boxes at most and I notice before the rest run. NEEDRESTART_MODE: a stops the run hanging forever on Debian’s interactive “which services should restart?” prompt, which there is nobody around to answer at four in the morning.

It is also idempotent, which matters more than it sounds. Run it against a host that is already current and Ansible checks, finds nothing to do, and moves on. That is what makes it safe to run whenever I like, and later, safe to run on a timer.

I bolted a summary onto the end so each run finishes with something readable instead of me squinting at the recap:

  pbs              updated
  timemachine      updated
  jellyfin         up-to-date  (reboot required)
  step-ca          up-to-date
  ...

That was the plan. Then I ran it for the first time, and it broke in three different places.

What broke

The first run died partway through downloading, with 400 Bad Request coming back from security.debian.org. A different package failed on each host, every time after a stretch of unusually slow transfer. The mirror was fine; fetching the exact file it choked on by hand returned a clean 200. The reason is that security.debian.org sits behind a Fastly CDN, and apt reuses one connection to request many packages back to back. Fastly occasionally answers one of those pipelined requests with a 400, and apt treats it as fatal. Turning pipelining off fixes it, so the playbook now writes a small apt config onto each guest before upgrading:

Acquire::Retries "3";
Acquire::http::Pipeline-Depth "0";

“It works when I curl it” and “it works under apt’s connection reuse” turned out to be different claims.

The backup server failed differently, with a 401 from enterprise.proxmox.com. Proxmox ships with its enterprise repository switched on, and that repo needs a paid subscription; without one, apt update refuses to go any further. Disabling the enterprise repo and leaning on the free no-subscription one is the standard fix, but it is a per-host bit of setup, and I was glad the playbook stopped and showed me rather than papering over it. The same trap is waiting on the Proxmox hosts themselves, which is part of why I patch those through the web UI and keep them out of this playbook.

The bigger surprise was that my containers do not all log in the same way. The older ones still take root over SSH. The newer, hardened ones refuse root and only allow a nick user with sudo, so the inventory carries a per-host override:

timemachine:
  ansible_host: 10.0.1.208
  ansible_user: nick      # root SSH disabled on this one

And the Time Machine box predated any of my conventions entirely: no nick user, and sudo was not even installed. Ansible cannot log into an account that does not exist, so that one needed a quick fix through the Proxmox console before it would join in. The playbook did not cause any of that inconsistency. It just tried to treat fifteen boxes the same way, and found every spot where I had not.

One thing that did not become a problem: reboots. An LXC container runs on the host’s kernel, so a kernel update inside one does nothing on its own, and rebooting the container changes nothing. The only guest that genuinely needs a reboot is the single full VM. So the playbook leaves reboots off by default: when I run it by hand it just lists the hosts that asked for one and I deal with them when it suits me. The weekly run is the exception, and I will come back to that.

Making it run itself

A playbook I have to remember to run is not much of a solution. I wanted this to happen on its own, so I hung it off Forgejo Actions, the CI built into my self-hosted git server, on a weekly timer:

on:
  schedule:
    - cron: '0 4 * * 0'          # Sundays, 04:00
  workflow_dispatch: {}          # or trigger it by hand

This is where I walked into something that was obvious in hindsight. I had been running the playbook from my laptop, which already has SSH access to every guest. CI runs somewhere else, on a dedicated runner container with none of that access and no Ansible installed. Scheduling it really meant giving the runner its own way in: a dedicated SSH key, its public half authorised on every guest (with one more small Ansible play), and the private half handed to the runner as an encrypted CI secret.

That does hand the runner a key with broad access across the fleet, which on a private network behind Tailscale is a trade I am happy with. The scheduled job also skips the runner itself, so it is not trying to patch and reboot the machine it is running on.

The weekly run is also the one place I let it reboot on its own. Running by hand, I keep reboots off and handle them later, but at four on a Sunday morning nothing is in use, so the scheduled job passes auto_reboot=true and restarts whatever asked for it without me watching. In practice that is just the one VM.

One last snag: the bare runner had no locale generated, and Ansible’s Python refuses to start without one. Pinning LC_ALL=C.UTF-8 in the workflow sorted it.

What it’s like now

It runs every Sunday at four in the morning, works through the guests in threes, reboots the one VM that wants it, and leaves a summary in the CI log. I find out it happened by not hearing about it.

Patching is the least interesting thing I do to this setup. Nothing new exists at the end of it and there is nothing to show anyone. But it is the part that quietly keeps everything alive, and moving it from “a job I put off” to “a thing that runs whether I think about it or not” was worth more than most of the services I have added. The whole point of a homelab is to learn this by getting it wrong somewhere it does not matter, and a forgotten apt upgrade across fifteen Debian boxes turned out to be a good teacher. Most of what it taught me was about my own past shortcuts, surfaced the moment one command tried to treat every box the same.

The gap

The obvious tool

What broke

Making it run itself

What it’s like now

Discussion