<!--
.. title: NixOS impermanence with F2FS
.. slug: nixos-impermanence-with-f2fs
.. date: 2026-04-30 06:43:28 UTC-04:00
.. tags: nix, nixos, f2fs, configuration management
.. category: 
.. link: 
.. description: A rant on configuration management and a few notes on a fun experiment I'm running in my homelab VM setup.
.. type: text
-->


## The futility of configuration management

### The problem: Configuration drift

*Configuration drift* is when a divergence occurs between a system's desired
or intended state and its actual state. Generally, this is due to an accumulation
of small changes over time -- like an ad-hoc fix to solve an emergency, or a
package setting a configuration option that's unexpected or unaccounted for, or
a change being applied to one instance of a distributed system and not the
others. It's the buildup of cruft and detritus that occurs over the lifespan of
a running system, and it can become a problem when it leads to unexpected
behavior.

The paper *[Why Order Matters](http://www.infrastructures.org/papers/turing/turing.html)*
has a useful heuristic for breaking down the problem of configuration drift in
live systems. It focuses on defining the desired configuration in terms of disk
state rather than system behavior:

> The behavioral attributes of a complex host seem to be effectively infinite
> over all possible inputs, and therefore difficult to fully quantify (§A.9).
> The disk size is finite, so we can completely describe hosts in terms of disk
> content, but we cannot completely describe hosts in terms of behavior. We can
> easily test all disk content, but we do not seem to be able to test all possible behavior.

This paper was written in 2002, so the focus on disk state per se feels slightly
dated. For example, it's normal in the high-performance computing (HPC) world to
have clusters with hundreds or thousands of individual nodes, each of which boots
its OS entirely over the network, completely bypassing the need for a local
disk. But, obviously, there's still a filesystem, so I think it's fair to
extrapolate the paper's focus on "disk state" to one of *filesystem state*,
with the understanding that the filesystem in question might be entirely
resident in memory, located on a network, or some other weird thing.

In *Why Order Matters*'s reckoning, a given system can be either *divergent*,
*convergent*, or *congruent*.

* **Divergent** systems have a filesystem state that, over time, creeps further
  and further away from the desired state. This is your classic "server in a
  closet running since 2016 that we fix when it breaks and otherwise leave
  alone".
* **Convergent** systems have a filesystem state that starts divergent, but over
  time, is brought closer to alignment with the desired state. Some brave soul
  has started putting the server-in-a-closet's configuration into Ansible,
  adding new parts to the playbook when they find something on it that isn't as
  it should be.
* **Congruent** systems have a filesystem state that is kept in lockstep with
  the desired state. It starts out the way you want, and it stays the way you
  want.

Clearly, from a sysadmin's point of view, congruency is the ideal. The
question is how to achieve it.

### Things that aren't quite solutions

One way is just to reduce the effective lifespan of the system -- sidestepping
the problem altogether and denying the cruft a chance to accumulate by limiting
the time that a system is running. If you're using some sort of auto-scaled VM
that's created on-demand from a known-good image and destroyed when it's no
longer needed, there generally isn't enough time for your system configuration
to diverge from that initial known-good state. Or, in the HPC world, the
aforementioned stateless network provisioning way of doing things also
effectively achieves this.

Another is to have the running system under some sort of configuration
management. This is what tools such as Ansible and Puppet have as their ultimate
aim. However, the limitations of those tools make using them to achieve
congruence essentially impossible. I'll use [Puppet](https://puppet.com)/
[OpenVox](https://voxpupuli.org/openvox/) as my example here, since that's what
I'm most familiar with due to my day job. To the best of my understanding, my
point broadly applies to Ansible, Salt, Chef/cinc, etc.

Puppet is a configuration management system. An agent runs on client machines
and pulls that node's specific configuration from a central server. System
configurations are written in a domain-specific language that lets you define
the intended state of the system. It works on whichever flavor of Linux you
prefer, several of the BSDs, and even hobby OSs like MacOS and Windows. You say
the system needs Firefox version 138 installed? It'll try to figure out the
right package manager and install Firefox for you. Change your mind and want to
make sure Firefox isn't installed? Tell Puppet to make sure Firefox is absent,
and it will remove it. `/var/your_file.txt` needs to be owned by root and
contain the text "hello I am a file"? It'll put it there. Want the `sshd`
service running? It'll figure out what your service manager is and make sure
it's running.

The fact that it works across so many different systems, each with different
abstractions, is ultimately what prevents it from achieving congruence. By
design, Puppet will only manage things you tell it to manage. If you install a
program on a Puppet-controlled system without using Puppet to do so, it cannot
and will not stop you. If you have Puppet managing the Firefox package in some
way, be it present or absent, and then remove your configuration of Firefox from
the system's configuration, Puppet won't undo anything it's done. It won't
uninstall Firefox, it won't return it to some default configuration -- it just
stops managing the configuration of Firefox, leaving it as it was the last time
the agent ran. Out of necessity, it has to respect any non-managed state on the
system, because it can't, on its own, manage the totality of filesystem state
for all the various operating systems it supports. Therefore, it can never
*ensure* that your system is completely congruent. There will always be the
possibility that something else, outside of the configuration management system,
changes some operationally-relevant part of the filesystem.

A true congruent configuration management system must control the **entire**
state of the system, not just the parts you explicitly tell it to manage.
Another way to phrase that is, on a congruent system, there is no such thing as
un-managed state.

### Nix, NixOS, and congruent configuration management

[NixOS](https://nixos.org/) grew out of the Nix project. Nix was developed as a
package manager that was simultaneously more specific, more correct, and more
flexible than existing systems like RPM and dpkg.
I'm going to hand-wave over the details here (read
[the white paper](https://edolstra.github.io/pubs/nspfssd-lisa2004-final.pdf)
for as much detail as you can handle) but there are a few key parts I want to
highlight:

* Nix packages are defined using a domain-specific language that describes the
  inputs and build process.
* Packages and their dependencies are referenced not by package name and version
  number, but by a hash that uniquely identifies
  the specific iteration of that package. Dependencies, build options,
  optimizations, patches -- changing any one aspect of the build produces a
  package with a different hash.
* Built packages are stored in the *Nix store*, usually at `/nix/store`.
* To ensure package consistency, the Nix store is read-only for all users.
  Everything that wants to add or remove things from the Nix store must go
  through the Nix daemon.
* To expose a particular set of packages to a user, they are grouped into
  *profiles*, which at base are collections of symlinks to the files that the
  various desired packages provide -- so, for example, on my personal laptop,
  where I use Nix to manage some apps:

```shell
sabo@alluminio:~$ which sway
/home/sabo/.nix-profile/bin/sway
sabo@alluminio:~$ readlink `!!`
readlink `which sway`
/nix/store/5gir9sizyasdcm5y54r4m950yhjygdi4-swayfx-0.5.3/bin/sway
```

NixOS expands this concept to the operating system as a whole by adding the
ability to manage the configuration of a system in the same way that Nix manages
packages. Specifically, you no longer edit config files directly on disk. You
use the Nix language to manage the files or, when supported, set the options
directly in your `configuration.nix` file. A command called `nixos-rebuild` will
determine the changes required, add the resulting files to the Nix store, and
switch around the symlinks to make the filesystem state match the desired
configuration.

To demonstrate, here's how `/etc/ssh/sshd_config` is managed on a NixOS machine:

```shell
[sabo@sundryvm:~]$ ls -lah /etc/ssh/sshd_config
lrwxrwxrwx 1 root root 27 May  3 09:50 /etc/ssh/sshd_config -> /etc/static/ssh/sshd_config

[sabo@sundryvm:~]$ ls -lah /etc/static/ssh/sshd_config
lrwxrwxrwx 1 root root 59 Dec 31  1969 /etc/static/ssh/sshd_config -> /nix/store/j0banccqz5wnzwsikl95y0gmmcw8jx98-sshd.conf-final
```

Every aspect of the OS from the bootloader to the desktop can be managed by
NixOS -- and probably should, since it rejects more than a few assumptions
shared by most other Linux operating systems, such as the
[Filesystem Hierarchy Standard](https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html).

Nix, the package manager, can run on any flavor of Linux (and MacOS) you like.
But NixOS is an all-or-nothing deal -- you can't do "just a little NixOS" on
another Linux distro, the way you *can* have only a few things on a system be
configured with Ansible or Puppet.

NixOS, with the ability to control all filesystem state in the system, is much
closer to supporting truly congruent configuration management. It gives us the
tools required to manage *all* aspects of a working system. The missing piece is
ensuring that there's no state *outside* of that management. This is where
impermanence, also known as "ephemeral root" and a dozen other synonyms, comes
in.

## True congruence through impermanence

Impermanence takes advantage of the fact that NixOS keeps all managed state in
the Nix store, and reconstructs the required symlinks on boot. To get a running
system, all that's needed is a persistent `/boot` (so the bootloader can load
the kernel, initrd, etc) and a persistent `/nix`. The entire rest of the
filesystem can be blank, as long as those two directories are populated. The
various impermanence methods do just that, using various tricks and filesystem
features to make sure that the filesystem state (besides those two directories)
is empty on every system start.

### The use case

For my homelab setup, I originally went with Ansible as a configuration
management system for a handful of VMs running various services. I quickly found
out that I really don't like Ansible, for reasons that are outside the scope of
this post. So, now I'm investigating a NixOS-based setup, and decided to jump in
with both feet and aim for a completely congruent configuration by building an
impermanent system.

### The usual impermanence options

Most impermanent NixOS setups seem to use either tmpfs, btrfs, or ZFS as their
root filesystem. For such a system, we need to wipe the slate clean (that is,
remove any and all files on the root filesystem) on boot. Systems with tmpfs as
root get this automatically, because data in RAM gets wiped on every boot. ZFS
and btrfs-based solutions solve this by restoring a snapshot of the initial,
empty state of the filesystem on every boot,

For this use case, I didn't really like any of those options:

* **tmpfs**: This would run the risk of eating into the VMs' limited memory as
  files are written to root. I'd rather have that memory doing useful things.
* **btrfs**: I have a chip on my shoulder about btrfs. I've had too many times
  where a system suddenly runs out of space and required hellacious contortions
  to get it back on its feet, when I was able to at all.
* **ZFS**: I love ZFS. I use it everywhere I can. I don't want to use ZFS here
  for a few reasons.
    * dkms. ZFS is out-of-kernel, and uses dkms to build the required kernel
      modules on every update. NixOS (and pretty much every other distro that
      ships it) handles this well, but I wanted an in-kernel filesystem to avoid
      spending time building modules on a small VM on every kernel update.
    * ZFS's data safety benefits aren't really applicable here. For one, on at
      least the root FS, not keeping data is kind of the point. For another, I'm
      using Proxmox with local ZFS storage, and the VM disks themselves are ZVOLs
      on ZFS. Through that, I'm already getting data checksums and redundancy for
      the disk state that I *do* want to keep. Why put a hat on a hat?
    * The other nice features ZFS offers aren't useful to me here.
* honorable mention to **bcachefs**, the Waluigi of Linux filesystems. 
  Like btrfs and ZFS, it's a copy-on-write filesystem with snapshots support. I
  suppose it could work here. I haven't played with it at all. But, it's not as
  mature as ZFS, and (as of recently) not in-kernel, so there's nothing to gain
  over ZFS here.

### The wildcard options

There are two other in-kernel filesystems that support some form of snapshotting
-- nilfs2 and f2fs. These are log-structured filesystems designed for use in
flash devices. From
["The design and implementation of a log-structured file system"](https://dl.acm.org/doi/10.1145/146941.146943),
via [the kernel docs for f2fs](https://docs.kernel.org/filesystems/f2fs.html):

> A log-structured file system writes all modifications to disk sequentially in
> a log-like structure, thereby speeding up both file writing and crash
> recovery. The log is the only structure on disk; it contains indexing
> information so that files can be read back from the log efficiently.


I'm not sure how widespread nilfs2 is used, but f2fs has apparently been the
default in Android since 2016. I was drawn to these filesystems for two reasons:
because they're in-kernel, and because they both have some degree of
snapshotting support. From my research, there doesn't seem to be many (or any)
people using either of these filesystems for this purpose.

I first experimented with nilfs2. NILFS2 automatically creates "checkpoints" of
the state of the file system on every write. These checkpoints are
garbage-collected at regular intervals. The user can select certain checkpoints
as snapshots, which saves them from the garbage collection process. Checkpoints
and snapshots can be mounted read-only, allowing you to view the filesystem as
it existed at that point in time. As far as I can tell, however, there's no
"rollback" functionality, nor is there a way to mount a snapshot read-write --
so there's no way to actually restore a filesystem to a blank initial state. So
much for nilfs2.

f2fs has a much simpler checkpoint scheme. There's no arbitrary snapshotting
that a user can control. A checkpoint in f2fs represents a known-good state of
the filesystem's internal data structures. When a f2fs filesystem starts up, it
finds the most recent checkpoint and goes forward from there. In fact, f2fs
only ever has at most two checkpoints defined: the most recent one, and the one
before it. On the surface, this seemed inflexible to the point of uselessness
for this project. However, with the mount flag `checkpoint=disable`, we can
prevent f2fs from creating a new checkpoint. From the [f2fs docs](https://docs.kernel.org/filesystems/f2fs.html#mount-options):

> While disabled, any unmounting or unexpected shutdowns will cause the
> filesystem contents to appear as they did when the filesystem was mounted
> with that option.

So, if we create a new f2fs filesystem, and *only ever mount it with
checkpointing disabled*, we get our amnesiac root file system.

## Implementation

### NixOS installation

I used the [disko](https://github.com/nix-community/disko) tool to format the
drive. When creating the root filesystem, I made sure to set the `extra_attr`
and `compression` options -- not to save space, since, well, I'm not worried
about that if everything gets wiped on every boot, but to reduce the number of
writes to the VM drive.

```nix
rootstore = {
  size = "5G";
  content = {
    type = "filesystem";
    format = "f2fs";
    extraArgs = [ 
      "-O" "
      extra_attr,compression"
    ];
    mountOptions = [
      "checkpoint=disable"
      "compress_algorithm=zstd"
      "lazytime"
    ];
    mountpoint = "/";
  };
};
```

Once `disko` had done its magic, I proceeded with the installation of NixOS,
following the [installation instructions](https://nixos.org/manual/nixos/stable/#sec-installation-manual).
The only additional change I had to make was to add the all-important mount
parameters to `configuration.nix`, since for some reason the `mountOptions` were
not carried over into the `hardware-configuration.nix` file you are told not to
edit.

```nix
fileSystems."/" =
  { device = "/dev/mapper/pool-rootstore";
    fsType = "f2fs";
    options = [ "checkpoint=disable" "compress_algorithm=zstd" "lazytime" ];
  };
```

And that's...pretty much it. It works as expected. Root is empty on every boot.

### When things break

After the VM was set up, to understand possible failure modes, I set about
trying to break it by doing things you shouldn't do. Filling up the root
filesystem, writing a bunch of random data around, stopping the system without
cleanly shutting it down, etc.

One of the criticisms of f2fs I've seen is that it's fairly easy to corrupt if
you, say, cut power to the system, and `fsck.f2fs` isn't very good at
correcting errors. Now, clearly, we don't care about data corruption if there's
no data to corrupt on the filesystem. However, in trying to break this VM in a
different way, I semi-accidentally corrupted something or other on the root FS,
probably by stopping the VM without shutting down the OS. The system refused to
mount it since `fsck` wasn't returning clean. The "fix" here is just to log in
to a rescue shell by adding the boot parameters
`rescue systemd.setenv=SYSTEMD_SULOGIN_FORCE=1` and reformatting the root
partition.

The other scenario where a reformat is needed is if you somehow mount the root
filesystem without that `checkpoint=disable` option, as I did when I
accidentally removed that `filesystems."/"` override from my
`configuration.nix`. Without that option, `f2fs` will happily do the
right-in-every-circumstance-except-this-one thing and save a checkpoint for all
the data you write to the disk during that boot. There's no way to go back to an
earlier checkpoint with `f2fs`, so after that, the only way to restore the disk
to a blank state is to format it again.

I *think* it should be possible to add some systemd magic to the boot process to
cover both those cases, like [this person](https://notthebe.ee/blog/nixos-ephemeral-zfs-root/)
does with his ZFS-based ephemeral setup. Essentially just add a unit file that
runs after `systemd-fsck-root.service` fails, or if a canary file was detected
on /, that reformats the root filesystem. I haven't had a chance to play around
with that idea yet.

## Conclusion

This is entirely just for fun on my personal systems with my personal data. I
think it'd be real silly for someone to do this in any kind of production
environment.
