February 2024 – blog.germancoding.com

Recently, it happened to me as well: I ran out of disk space on a “production” system, and all hell broke loose. So here’s the short postmortem:

The trigger was me playing around with server-side LDAP settings. Ironically those were intended to make stuff more stable and prevent outages. The new config was enabled, I verified that the LDAP clients could still logon and everything.

The next day, everything seemed to be fine. That was, until some scripts on one of my machines started behaving a bit erratically. It logged a few unusual errors (such as being unable to write to a file), but everything else seemed normal – no service was down. Eventually stuff started crashing at about the same time when I started investigating the unusual errors.

Analysis quickly pointed to disk trouble: A failing disk perhaps? No, #df -h revealed the problem: /dev/mmcblk0p3 55G 55G 0 100% /. The 64 GB eMMC on my Odroid had completly filled up to 100%. The normal disk usage on that eMMC is at around 20 GiB. So who had eaten around 35 GiB of disk space? Further checks with
#du -hs /var/lib/docker/containers appointed the blame to Docker: One of the containers was over 30 GB in size!

How could that happen? Well, remember the LDAP changes? On the machine, the affected container contained a login routine that was trying to login via LDAP. This had ran into a hiccup where the routine failed to login¹, logged an error, then tried again. It was doing this on an infinite loop – there was no backoff or retry count/timeout built into it.

The sheer amount of error lines emitted by the routine caused the docker container’s log file (which is in json format by default) to grow huge. And now to the real surprise: The Docker default json logging driver does not have any sort of rotation or size limit build into it! It just fills your disk until eternity. The docker configuration manual has a notice that warns about this behaviour, but honestly who digs so deep in that manual? It’s not like it’s a top-10 page or something. Why the heck is this the default? “For backwards compatibility reasons” – well, then why don’t we change it for new setups only? I think this is a really stupid design decision, but yeah.

So, lessons learned:

On all new docker installs, change /etc/docker/daemon.json to use the local log driver, or json with a size limit configured.
Monitor your disk usage at all times, with alerting if stuff becomes critical.
Implement backoffs for operations that should be retried. Don’t hammer infinite loops when stuff doesn’t work.²