NixOS impermanence with F2FS


Last updated 2026-05-08 16:36

The futility of configuration management๐Ÿ”—

The problem: Configuration drift๐Ÿ”—

Configuration drift is when a divergence occurs between a system's desired or intended state and its actual state. Generally, this is due to an accumulation of small changes over time -- like an ad-hoc fix to solve an emergency, or a package setting a configuration option that's unexpected or unaccounted for, or a change being applied to one instance of a distributed system and not the others. It's the buildup of cruft and detritus that occurs over the lifespan of a running system, and it can become a problem when it leads to unexpected behavior.

The paper Why Order Matters has a useful heuristic for breaking down the problem of configuration drift in live systems. It focuses on defining the desired configuration in terms of disk state rather than system behavior:

The behavioral attributes of a complex host seem to be effectively infinite over all possible inputs, and therefore difficult to fully quantify (ยงA.9). The disk size is finite, so we can completely describe hosts in terms of disk content, but we cannot completely describe hosts in terms of behavior. We can easily test all disk content, but we do not seem to be able to test all possible behavior.

This paper was written in 2002, so the focus on disk state per se feels slightly dated. For example, it's normal in the high-performance computing (HPC) world to have clusters with hundreds or thousands of individual nodes, each of which boots its OS entirely over the network, completely bypassing the need for a local disk. But, obviously, there's still a filesystem, so I think it's fair to extrapolate the paper's focus on "disk state" to one of filesystem state, with the understanding that the filesystem in question might be entirely resident in memory, located on a network, or some other weird thing.

In Why Order Matters's reckoning, a given system can be either divergent, convergent, or congruent.

  • Divergent systems have a filesystem state that, over time, creeps further and further away from the desired state. This is your classic "server in a closet running since 2016 that we fix when it breaks and otherwise leave alone".
  • Convergent systems have a filesystem state that starts divergent, but over time, is brought closer to alignment with the desired state. Some brave soul has started putting the server-in-a-closet's configuration into Ansible, adding new parts to the playbook when they find something on it that isn't as it should be.
  • Congruent systems have a filesystem state that is kept in lockstep with the desired state. It starts out the way you want, and it stays the way you want.

Clearly, from a sysadmin's point of view, congruency is the ideal. The question is how to achieve it.

Things that aren't quite solutions๐Ÿ”—

One way is just to reduce the effective lifespan of the system -- sidestepping the problem altogether and denying the cruft a chance to accumulate by limiting the time that a system is running. If you're using some sort of auto-scaled VM that's created on-demand from a known-good image and destroyed when it's no longer needed, there generally isn't enough time for your system configuration to diverge from that initial known-good state. Or, in the HPC world, the aforementioned stateless network provisioning way of doing things also effectively achieves this.

Another is to have the running system under some sort of configuration management. This is what tools such as Ansible and Puppet have as their ultimate aim. However, the limitations of those tools make using them to achieve congruence essentially impossible. I'll use Puppet/ OpenVox as my example here, since that's what I'm most familiar with due to my day job. To the best of my understanding, my point broadly applies to Ansible, Salt, Chef/cinc, etc.

Puppet is a configuration management system. An agent runs on client machines and pulls that node's specific configuration from a central server. System configurations are written in a domain-specific language that lets you define the intended state of the system. It works on whichever flavor of Linux you prefer, several of the BSDs, and even hobby OSs like MacOS and Windows. You say the system needs Firefox version 138 installed? It'll try to figure out the right package manager and install Firefox for you. Change your mind and want to make sure Firefox isn't installed? Tell Puppet to make sure Firefox is absent, and it will remove it. /var/your_file.txt needs to be owned by root and contain the text "hello I am a file"? It'll put it there. Want the sshd service running? It'll figure out what your service manager is and make sure it's running.

The fact that it works across so many different systems, each with different abstractions, is ultimately what prevents it from achieving congruence. By design, Puppet will only manage things you tell it to manage. If you install a program on a Puppet-controlled system without using Puppet to do so, it cannot and will not stop you. If you have Puppet managing the Firefox package in some way, be it present or absent, and then remove your configuration of Firefox from the system's configuration, Puppet won't undo anything it's done. It won't uninstall Firefox, it won't return it to some default configuration -- it just stops managing the configuration of Firefox, leaving it as it was the last time the agent ran. Out of necessity, it has to respect any non-managed state on the system, because it can't, on its own, manage the totality of filesystem state for all the various operating systems it supports. Therefore, it can never ensure that your system is completely congruent. There will always be the possibility that something else, outside of the configuration management system, changes some operationally-relevant part of the filesystem.

A true congruent configuration management system must control the entire state of the system, not just the parts you explicitly tell it to manage. Another way to phrase that is, on a congruent system, there is no such thing as un-managed state.

Nix, NixOS, and congruent configuration management๐Ÿ”—

NixOS grew out of the Nix project. Nix was developed as a package manager that was simultaneously more specific, more correct, and more flexible than existing systems like RPM and dpkg. I'm going to hand-wave over the details here (read the white paper for as much detail as you can handle) but there are a few key parts I want to highlight:

  • Nix packages are defined using a domain-specific language that describes the inputs and build process.
  • Packages and their dependencies are referenced not by package name and version number, but by a hash that uniquely identifies the specific iteration of that package. Dependencies, build options, optimizations, patches -- changing any one aspect of the build produces a package with a different hash.
  • Built packages are stored in the Nix store, usually at /nix/store.
  • To ensure package consistency, the Nix store is read-only for all users. Everything that wants to add or remove things from the Nix store must go through the Nix daemon.
  • To expose a particular set of packages to a user, they are grouped into profiles, which at base are collections of symlinks to the files that the various desired packages provide -- so, for example, on my personal laptop, where I use Nix to manage some apps:
sabo@alluminio:~$ which sway
/home/sabo/.nix-profile/bin/sway
sabo@alluminio:~$ readlink `!!`
readlink `which sway`
/nix/store/5gir9sizyasdcm5y54r4m950yhjygdi4-swayfx-0.5.3/bin/sway

NixOS expands this concept to the operating system as a whole by adding the ability to manage the configuration of a system in the same way that Nix manages packages. Specifically, you no longer edit config files directly on disk. You use the Nix language to manage the files or, when supported, set the options directly in your configuration.nix file. A command called nixos-rebuild will determine the changes required, add the resulting files to the Nix store, and switch around the symlinks to make the filesystem state match the desired configuration.

To demonstrate, here's how /etc/ssh/sshd_config is managed on a NixOS machine:

[sabo@sundryvm:~]$ ls -lah /etc/ssh/sshd_config
lrwxrwxrwx 1 root root 27 May  3 09:50 /etc/ssh/sshd_config -> /etc/static/ssh/sshd_config

[sabo@sundryvm:~]$ ls -lah /etc/static/ssh/sshd_config
lrwxrwxrwx 1 root root 59 Dec 31  1969 /etc/static/ssh/sshd_config -> /nix/store/j0banccqz5wnzwsikl95y0gmmcw8jx98-sshd.conf-final

Every aspect of the OS from the bootloader to the desktop can be managed by NixOS -- and probably should, since it rejects more than a few assumptions shared by most other Linux operating systems, such as the Filesystem Hierarchy Standard.

Nix, the package manager, can run on any flavor of Linux (and MacOS) you like. But NixOS is an all-or-nothing deal -- you can't do "just a little NixOS" on another Linux distro, the way you can have only a few things on a system be configured with Ansible or Puppet.

NixOS, with the ability to control all filesystem state in the system, is much closer to supporting truly congruent configuration management. It gives us the tools required to manage all aspects of a working system. The missing piece is ensuring that there's no state outside of that management. This is where impermanence, also known as "ephemeral root" and a dozen other synonyms, comes in.

True congruence through impermanence๐Ÿ”—

Impermanence takes advantage of the fact that NixOS keeps all managed state in the Nix store, and reconstructs the required symlinks on boot. To get a running system, all that's needed is a persistent /boot (so the bootloader can load the kernel, initrd, etc) and a persistent /nix. The entire rest of the filesystem can be blank, as long as those two directories are populated. The various impermanence methods do just that, using various tricks and filesystem features to make sure that the filesystem state (besides those two directories) is empty on every system start.

The use case๐Ÿ”—

For my homelab setup, I originally went with Ansible as a configuration management system for a handful of VMs running various services. I quickly found out that I really don't like Ansible, for reasons that are outside the scope of this post. So, now I'm investigating a NixOS-based setup, and decided to jump in with both feet and aim for a completely congruent configuration by building an impermanent system.

The usual impermanence options๐Ÿ”—

Most impermanent NixOS setups seem to use either tmpfs, btrfs, or ZFS as their root filesystem. For such a system, we need to wipe the slate clean (that is, remove any and all files on the root filesystem) on boot. Systems with tmpfs as root get this automatically, because data in RAM gets wiped on every boot. ZFS and btrfs-based solutions solve this by restoring a snapshot of the initial, empty state of the filesystem on every boot,

For this use case, I didn't really like any of those options:

  • tmpfs: This would run the risk of eating into the VMs' limited memory as files are written to root. I'd rather have that memory doing useful things.
  • btrfs: I have a chip on my shoulder about btrfs. I've had too many times where a system suddenly runs out of space and required hellacious contortions to get it back on its feet, when I was able to at all.
  • ZFS: I love ZFS. I use it everywhere I can. I don't want to use ZFS here for a few reasons.
    • dkms. ZFS is out-of-kernel, and uses dkms to build the required kernel modules on every update. NixOS (and pretty much every other distro that ships it) handles this well, but I wanted an in-kernel filesystem to avoid spending time building modules on a small VM on every kernel update.
    • ZFS's data safety benefits aren't really applicable here. For one, on at least the root FS, not keeping data is kind of the point. For another, I'm using Proxmox with local ZFS storage, and the VM disks themselves are ZVOLs on ZFS. Through that, I'm already getting data checksums and redundancy for the disk state that I do want to keep. Why put a hat on a hat?
    • The other nice features ZFS offers aren't useful to me here.
  • honorable mention to bcachefs, the Waluigi of Linux filesystems. Like btrfs and ZFS, it's a copy-on-write filesystem with snapshots support. I suppose it could work here. I haven't played with it at all. But, it's not as mature as ZFS, and (as of recently) not in-kernel, so there's nothing to gain over ZFS here.

The wildcard options๐Ÿ”—

There are two other in-kernel filesystems that support some form of snapshotting -- nilfs2 and f2fs. These are log-structured filesystems designed for use in flash devices. From "The design and implementation of a log-structured file system", via the kernel docs for f2fs:

A log-structured file system writes all modifications to disk sequentially in a log-like structure, thereby speeding up both file writing and crash recovery. The log is the only structure on disk; it contains indexing information so that files can be read back from the log efficiently.

I'm not sure how widespread nilfs2 is used, but f2fs has apparently been the default in Android since 2016. I was drawn to these filesystems for two reasons: because they're in-kernel, and because they both have some degree of snapshotting support. From my research, there doesn't seem to be many (or any) people using either of these filesystems for this purpose.

I first experimented with nilfs2. NILFS2 automatically creates "checkpoints" of the state of the file system on every write. These checkpoints are garbage-collected at regular intervals. The user can select certain checkpoints as snapshots, which saves them from the garbage collection process. Checkpoints and snapshots can be mounted read-only, allowing you to view the filesystem as it existed at that point in time. As far as I can tell, however, there's no "rollback" functionality, nor is there a way to mount a snapshot read-write -- so there's no way to actually restore a filesystem to a blank initial state. So much for nilfs2.

f2fs has a much simpler checkpoint scheme. There's no arbitrary snapshotting that a user can control. A checkpoint in f2fs represents a known-good state of the filesystem's internal data structures. When a f2fs filesystem starts up, it finds the most recent checkpoint and goes forward from there. In fact, f2fs only ever has at most two checkpoints defined: the most recent one, and the one before it. On the surface, this seemed inflexible to the point of uselessness for this project. However, with the mount flag checkpoint=disable, we can prevent f2fs from creating a new checkpoint. From the f2fs docs:

While disabled, any unmounting or unexpected shutdowns will cause the filesystem contents to appear as they did when the filesystem was mounted with that option.

So, if we create a new f2fs filesystem, and only ever mount it with checkpointing disabled, we get our amnesiac root file system.

Implementation๐Ÿ”—

NixOS installation๐Ÿ”—

I used the disko tool to format the drive. When creating the root filesystem, I made sure to set the extra_attr and compression options -- not to save space, since, well, I'm not worried about that if everything gets wiped on every boot, but to reduce the number of writes to the VM drive.

rootstore = {
  size = "5G";
  content = {
    type = "filesystem";
    format = "f2fs";
    extraArgs = [ 
      "-O" "
      extra_attr,compression"
    ];
    mountOptions = [
      "checkpoint=disable"
      "compress_algorithm=zstd"
      "lazytime"
    ];
    mountpoint = "/";
  };
};

Once disko had done its magic, I proceeded with the installation of NixOS, following the installation instructions. The only additional change I had to make was to add the all-important mount parameters to configuration.nix, since for some reason the mountOptions were not carried over into the hardware-configuration.nix file you are told not to edit.

fileSystems."/" =
  { device = "/dev/mapper/pool-rootstore";
    fsType = "f2fs";
    options = [ "checkpoint=disable" "compress_algorithm=zstd" "lazytime" ];
  };

And that's...pretty much it. It works as expected. Root is empty on every boot.

When things break๐Ÿ”—

After the VM was set up, to understand possible failure modes, I set about trying to break it by doing things you shouldn't do. Filling up the root filesystem, writing a bunch of random data around, stopping the system without cleanly shutting it down, etc.

One of the criticisms of f2fs I've seen is that it's fairly easy to corrupt if you, say, cut power to the system, and fsck.f2fs isn't very good at correcting errors. Now, clearly, we don't care about data corruption if there's no data to corrupt on the filesystem. However, in trying to break this VM in a different way, I semi-accidentally corrupted something or other on the root FS, probably by stopping the VM without shutting down the OS. The system refused to mount it since fsck wasn't returning clean. The "fix" here is just to log in to a rescue shell by adding the boot parameters rescue systemd.setenv=SYSTEMD_SULOGIN_FORCE=1 and reformatting the root partition.

The other scenario where a reformat is needed is if you somehow mount the root filesystem without that checkpoint=disable option, as I did when I accidentally removed that filesystems."/" override from my configuration.nix. Without that option, f2fs will happily do the right-in-every-circumstance-except-this-one thing and save a checkpoint for all the data you write to the disk during that boot. There's no way to go back to an earlier checkpoint with f2fs, so after that, the only way to restore the disk to a blank state is to format it again.

I think it should be possible to add some systemd magic to the boot process to cover both those cases, like this person does with his ZFS-based ephemeral setup. Essentially just add a unit file that runs after systemd-fsck-root.service fails, or if a canary file was detected on /, that reformats the root filesystem. I haven't had a chance to play around with that idea yet.

Conclusion๐Ÿ”—

This is entirely just for fun on my personal systems with my personal data. I think it'd be real silly for someone to do this in any kind of production environment.