Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

nixify — architecture & engineering practices

This is the public documentation surface for nixify, a private NixOS fleet managed declaratively with divnix/std + divnix/hive. The infrastructure source is private; what follows is a curated, secret-free description of how it is built, deployed, validated, and operated.

It is written for a technical hiring audience: the goal is to show how a non-trivial system is reasoned about and kept reproducible, not to expose the system itself. Every host is referred to by role, never by address, and a leak gate enforces that on every change.

Start here

At a glance

  • One typed monorepo. Every host, home profile, module, and CI shell is a cell in a single std/hive tree — discovered structurally, not by hand-wired imports.
  • Reproducible by construction. Flake-pinned inputs; the same closure builds locally, in CI, and on the deploy target.
  • Validated before it switches. A deploy is not “done” at build — it passes remote build, closure diff/RCA, dry-activate, and a switch test before it’s marked as a known-good point.
  • Agent-operated, operator-gated. Linear, convention-named branches; AI agents propose, the operator merges fast-forward only; protected branches never take a direct push.

Provenance: generated from nixify master @ 0d8ff78 (approved baseline); latest linear chain @ cd41d79; 2026-06-21.

Architecture overview

nixify is a single NixOS monorepo that defines an entire multi-host fleet — laptops, WSL dev boxes, a NAS, dedicated build machines, an edge ML node, and cloud burst capacity — as one typed, reproducible tree. It is built on divnix/std (a cell framework for organising Nix code) and divnix/hive (std’s NixOS fleet layer), and deployed with colmena.

This page describes the design at a level useful to an engineer evaluating the work. It contains no addresses, keys, or secrets — see the leak gate for why.

Why a monorepo of cells

Most personal infra grows into a pile of per-host config files that drift apart. nixify instead treats every concern as a cell — a self-describing unit that std discovers structurally. There are no hand-maintained import lists; you add a cell, and the framework wires it into the right outputs.

cells/
├── hosts/      nixosConfigurations  — one entry per machine in the fleet
├── cluster/    colmenaConfig        — fleet-wide deploy topology (by role)
├── nixos/      nixosProfiles        — composable system-level building blocks
├── home/       profiles             — home-manager user environments
├── modules/    reusable NixOS modules shared across hosts
├── users/      user identity / account definitions
├── agents/     AI-agent tooling: profiles, packages, scripts, skills
└── repo/       devshells            — the developer & CI entry shell

The payoff: a new host is a small composition of existing profiles, not a copy-paste of another host’s config. Shared behaviour lives in one module and every consumer gets fixes at once.

The fleet, by role

The fleet is ~15 hosts. They are documented here by role, never by hostname or address:

RoleWhat it does
WSL dev hostsDay-to-day development on Windows machines via NixOS-WSL. Build nothing heavy locally — they delegate to remote builders.
Remote builderThe workhorse that performs real Nix builds on behalf of constrained hosts.
Cloud burst builderOn-demand additional build/CI capacity.
NASStorage and service host.
Edge ML nodeA Jetson-class device for on-device inference workloads.
Operator workstation(s)Where deploys are authorised and merged.

A recurring design constraint shapes the whole system: resource-constrained hosts must never run heavy Nix evaluation or builds locally. They are configured with zero local build jobs and offload to a remote builder over a trusted channel. This keeps a laptop or WSL box responsive while still getting fully reproducible system closures.

The deploy-and-validate pipeline

A build succeeding is not the bar for “deployed.” nixify treats a build as the first of several gates, because a closure that builds can still fail to activate. The pipeline:

  1. Remote evaluate + build. The target closure is built on a dedicated builder, not on the (often constrained) host being deployed.
  2. Closure diff / RCA. The new closure is compared against the currently running one (package-version diff + structural diff) so every change is accounted for before it lands — no silent drift.
  3. Dry-activate. The activation script is run in dry mode to surface problems that only appear at switch time.
  4. Switch test. The configuration is actually switched in a test context to prove it activates, not merely builds.
  5. Mark known-good. On success the working closure is recorded as a garbage-collection-rooted, tagged “last-known-good” point that can be redeployed deterministically.

Rollback is redeploy the previous known-good closure — never “revert unvalidated history and hope.” This mirrors a release-pointer discipline: a release is an annotated tag on an exact validated commit, and a per-target pointer fast-forwards to it.

Secrets model

Secrets are managed declaratively and encrypted at rest, layered over sops-nix and an age/agenix-based flow. Host SSH identities are derived into age recipients, so a host can decrypt exactly the secrets provisioned for it and nothing else. No plaintext secret ever enters the repo or a build log. (Concrete recipients, key material, and the secret inventory are intentionally absent from this public surface.)

Engineering workflow: agent-operated, operator-gated

nixify is developed by a mix of a human operator and several AI coding agents working in parallel. The discipline that keeps that safe:

  • Convention-named, linear branches. Branches follow <owner>/<type>/<topic> (owner ∈ claude, gemini, copilot, codex; type ∈ feat, fix, refactor, docs, chore, ci, test). History stays linear — fixes amend the commit that introduced the problem rather than stacking forward-patches.
  • Protected baselines. The mainline and operator branches never take a direct agent push. Agents propose; the operator integrates fast-forward only. An explicit, quoted operator grant is the only override, and it is recorded in the merge itself.
  • Parallel chains, isolated worktrees. Each agent works its own branches in its own git worktree, forked from the operator branch the work was authored on. Agents never touch another agent’s branches; coordination happens through in-repo handoff notes.
  • Mechanical gates, not vibes. Native git hooks enforce the branch policy and a single inline-Nix gate runs the validation hooks. The rule is simple: no green, no merge.

Reproducibility & caching

All inputs are flake-pinned, so the same revision evaluates to the same closure everywhere. A binary cache shared across the fleet means a closure built once — in CI or on the builder — is fetched, not rebuilt, by every other host. CI builds the fleet and populates that cache as a side effect of validation.

Safety: the leak gate

Because this repository is public and describes private infrastructure, a scripts/leak-scan.sh gate runs on every push and as part of nix flake check. It fails the publish if it finds host addresses (RFC1918 or Tailscale-range), age recipients, SSH public-key blobs, or private-key blocks anywhere in the docs. Topology is expressed by role precisely so this gate can stay strict.


Provenance: generated from nixify master @ 0d8ff78 (approved baseline); latest linear chain @ cd41d79; 2026-06-21. This is a curated overview; a deeper per-subsystem wiki exists and is shared on request.

Deep dive (hiring-manager handover)

These pages go a layer below the architecture overview into the subsystems an interviewer is likely to probe. They are still scrubbed — no addresses, keys, or secret inventory — but they assume a reader who wants the mechanics, not the summary.

This deep/ set lives on an unmerged branch and is shared on request. It stays out of the default branch (and therefore out of the auto-generated public wiki) until that’s intended.

Contents

  • Secrets model — how sops-nix + age give each host exactly the secrets it needs and nothing else.
  • Deploy & validate pipeline — the gates between “it builds” and “it’s known-good,” and why each exists.
  • CI gates — the single inline-Nix git-hook gate and what it enforces at commit and push time.
  • Multi-agent branching — how several AI agents and one operator share a linear history without stepping on each other.

Provenance: nixify master @ 0d8ff78 (approved baseline); chain @ cd41d79; 2026-06-21.

Secrets model

The goal: every secret is encrypted at rest in the repo, decryptable only by the hosts (and the operator) that legitimately need it, with no plaintext ever entering a build log, the Nix store world-readably, or git history.

Layers

  1. sops-nix provides the NixOS integration: a host declares which encrypted files it consumes, and the activation step decrypts them into a private runtime path (not the world-readable store).
  2. age is the encryption backend. Recipients are age public keys; the matching private key lives only on the host or in the operator’s key store.
  3. Host identity → age recipient. A host’s existing SSH host key is derived into an age recipient. That means provisioning a new host doesn’t require minting and distributing a separate encryption identity — the identity the host already proves over SSH is the decryption identity.

Why derive age keys from SSH host keys

It collapses two problems into one. The host already has an SSH identity that the rest of the fleet trusts for remote-build and deploy channels. Reusing it as the decryption identity means:

  • No second secret to rotate when a host is rebuilt.
  • The set of “who can decrypt this” is exactly the set of hosts you already enumerate for deploy.
  • Re-keying a secret to add/remove a host is a declarative edit plus a re-key pass, not a manual key-handoff.

Blast radius

Each secret is encrypted to the minimal recipient set: the operator plus the specific hosts that mount it. A compromised host can decrypt only what was provisioned to it; it cannot read another host’s secrets, because it was never a recipient. The operator key is the only universal recipient and lives off the fleet.

What is intentionally not here

The recipient list, the secret inventory, key fingerprints, and the cache signing keys are all absent from this public surface by policy — the leak gate (scripts/leak-scan.sh) fails the publish if any of them appear.

Deploy & validate pipeline

The governing belief: a closure that builds is not a closure that works. Activation can fail for reasons a build never surfaces — a service that won’t restart, a migration that errors, an ordering problem between units. So a deploy is a sequence of gates, and only the last one earns the “known-good” label.

The gates

1. Remote evaluate + build

The target closure is evaluated and built on a dedicated builder, never on a constrained host. Constrained hosts (WSL boxes, laptops) are configured with zero local build jobs and offload everything. This keeps the dev machine responsive and makes the build environment consistent regardless of which host the closure is for.

2. Closure diff / RCA

Before anything switches, the candidate closure is diffed against the one currently running on the target:

  • a package-version diff (what moved, what was added/removed), and
  • a structural / derivation diff for changes that don’t show up as a version bump.

This is the root-cause-analysis step: every change in the new system is explained before it lands. It catches accidental input drift — an unrelated flake input bumping a dependency you didn’t mean to touch.

3. Dry-activate

The activation script runs in dry mode. This surfaces switch-time problems (restart ordering, would-be-failed units) without committing to the change.

4. Switch test

The configuration is actually switched in a test context — proving activation, not just evaluation. Where the host model supports it, this exercises the real switch-to-configuration path so the “it activates” claim is earned, not assumed.

5. Mark known-good

On success, the working closure is pinned as a garbage-collection-rooted, tagged last-known-good point. Because it’s GC-rooted, it survives a nix store gc; because it’s tagged, it’s addressable for redeploy.

Rollback

Rollback is redeploy the previous known-good closure, not “revert unvalidated history.” The known-good tag is an exact, reproducible target, so recovery is deterministic and doesn’t depend on rebuilding from a possibly-dirty intermediate state.

Release pointers

The same idea scales to releases: a release is an annotated tag (release/<target>/<UTC>-<shortrev>) on the exact validated commit, recording what was built and the verification evidence. A per-target release/<target> branch fast-forwards to that tag — a moving pointer to the latest proven-working state. Non-fast-forward updates are refused.

CI gates

The repo deliberately runs one validation gate, implemented as inline-Nix git hooks rather than an external framework. The history here is instructive: earlier iterations layered a third-party pre-commit framework and a separate hook runner on top of each other, which drifted out of sync and produced confusing failures. That was collapsed into a single inline-Nix gate so there is exactly one source of truth for “what must pass.”

At commit time

  • Branch-policy guard. A pure-bash guard refuses commits that violate the branch convention or target a protected branch directly (see multi-agent branching). No dependencies, so it runs identically for humans and agents on every host.
  • Conventional-commit shape. Commits are type(scope): summary; fixes amend the introducing commit so history stays linear.

At push time

  • Heavy validation is opt-out-by-design on constrained hosts. The push-time Nix evaluation/build routes through a remote builder; a constrained host does not run it locally. Where a host genuinely cannot reach the builder (e.g. a tunnel is down), the gate is skipped explicitly and visibly, not silently.
  • No --no-verify. Bypassing the gate is not part of the workflow. The documented escape hatches are narrow and named (for example, the upstream tool’s own “no config present” flag), and they preserve the real checks rather than skipping them.

In GitHub Actions

CI builds the fleet and, as a side effect of validation, populates the shared binary cache — so a closure proven in CI is fetched rather than rebuilt by the hosts. Actions usage is rationed deliberately: workflows that don’t need CI carry [skip ci], chronic-failing workflows are disabled until fixed rather than left to burn minutes, and a red run is reproduced locally before it is pushed again.

Why hooks over a hosted-only gate

Putting the gate in git hooks means the policy is enforced at the point of action, on every machine, before anything reaches the network — not only after a push when CI minutes are already spent. CI is the backstop, not the front line.

Multi-agent branching

nixify is worked on by several AI coding agents and one human operator, sometimes concurrently. The branching model exists to let that happen without agents corrupting each other’s work or the mainline — while keeping a strictly linear history.

Namespace

Branches are <owner>/<type>/<topic>:

  • ownerclaude, gemini, copilot, codex — which agent (or the human) owns the branch.
  • typefeat, fix, refactor, docs, chore, ci, test.
  • topic — the unit of work.

Protected refs (mainline, operator branches, and personal namespaces) are operator-only. Agents never commit, merge, or push there.

Parallel chains, one topic per root

Each agent works its own branches in its own git worktree, forked from the HEAD of the operator branch the work was authored on. One linear chain per topic; topics are never mixed across roots. Worktrees give each agent an isolated checkout sharing one object store, so parallel work doesn’t collide on a single working tree.

Agents do not touch another agent’s branches or worktrees. Cross-agent coordination happens through in-repo notes (handoff / session files at teardown), not by reaching into someone else’s checkout.

Integration is fast-forward only

Agents propose; the operator integrates. Merges into operator branches are fast-forward only, so the graph never grows a merge bubble and master stays a straight line. The only override is an explicit operator grant, and the rule is that the grant is quoted — its date, exact wording, and scope are recorded in the commit or merge body. (CLI merges that bypass local hooks make that quoting rule matter more, not less.)

Linear history, amend-at-source

A wrong commit is fixed where it was introduced — by amending that commit — rather than by stacking a forward-fix on top. The result is a history that reads as a sequence of correct, self-contained changes, which is far easier to bisect, review, and cherry-pick across the parallel chains.

Why this shape

Three properties fall out of it:

  1. Safety — protected branches can only move through the operator, so an agent mistake is contained to its own branch.
  2. Reviewability — linear, convention-named history makes “what changed and who changed it” obvious.
  3. Parallelism — isolated worktrees plus per-topic chains let multiple agents make progress at once without a shared-state bottleneck.