I Inherited a System With No Map. So I Drew Two.

Fri, 28 Feb 2025 00:00:00 +0000

When I took over DevOps, the handover was a person, not a document. That person was leaving. Everything I’d need to keep thirty-odd services and a fleet of customer servers alive lived in his head, in scattered runbooks, and in the muscle memory of having done it before. The classic shape: the system worked, and exactly one human knew why.

So the first real project wasn’t a migration or a dashboard. It was writing down the system before the only other copy walked out the door.

The obvious move is to write the docs — one big knowledge base, ordered however the system happens to be wired. I tried that for about a day. It doesn’t work, and the reason it doesn’t work is the whole point of this post.

The two questions a new hire is actually asking

Watch someone learn an unfamiliar platform and you’ll notice they’re never confused about one thing. They’re confused about two, and they’re different kinds of confused.

The first is “what is this technology?” — what’s a Pod, what does ArgoCD actually do, why would anyone want a secret manager with leases. This confusion is generic. It has nothing to do with us. The answer is the same whether you’re here or anywhere else.

The second is “how do we use it?” — where our ArgoCD lives, how our customer tokens are minted, which Grafana panel goes red first when a backup stalls. This confusion is entirely local. No textbook will ever answer it, because the answer is our repo and our decisions.

A single linear document forces these two into one sequence, and they fight. Explain Kubernetes from scratch and the engineer who already knows it skims and misses the system-specific bit buried in paragraph six. Skip the basics and the engineer who doesn’t know it is lost before they reach anything useful. You can’t order one list to serve both readers. So I stopped trying.

Track 1 is the textbook. Track 2 is the house.

The fix was to split the knowledge base along that exact seam.

Track 1 — Technical Foundation. Ten pages of generic DevOps: Linux, containers, Kubernetes concepts, Helm, GitOps & ArgoCD, GitLab CI/CD, Vault, Argo Events, observability, Terraform. Every page is something you could, in principle, read on any platform team on earth. Assumed background is stated up front — comfortable with Linux and shell, no Kubernetes required — so nobody has to guess whether a page is for them.

Track 2 — Our System. A dozen-plus pages of nothing but us: the cluster and its app-of-apps, the deploy pipelines, the customer model, the monitoring and backup agent, our Vault layout and token expiry monitoring, SSO, the approval portal, the full new-customer install. Every page assumes you already understand the underlying tech — and if you don’t, it links straight back to its Track 1 counterpart.

That’s the rule that keeps the split honest: each Track 1 page ends with an “in our system” link down to its implementation, and each Track 2 page names its Track 1 prerequisite at the top. Concept and implementation are separate documents, permanently wired to each other.

The win is that both tracks stand alone. A senior who’s done Kubernetes for years skips Track 1 entirely and reads Track 2 like a system design doc. A strong sysadmin with zero cloud-native experience leans hard on Track 1 first. Same knowledge base, two honest reading paths, neither one padded for the other reader.

The interleave is the whole trick

Two tracks on their own would just be two piles. The thing that makes them a roadmap is the order you walk them in — and the order is a zipper, not two straight lines.

Track 1: Technical Foundation        Track 2: Our System
───────────────────────────────      ──────────────────────────────────
K8s concepts          → then →       K8s in our cluster
ArgoCD concepts       → then →       our ArgoCD + GitOps flow
Vault concepts        → then →       Vault here, customer tokens
Observability theory  → then →       our Grafana dashboards, alert types

Learn the concept cold, then immediately see it wearing our clothes. The generic mental model gets nailed down by a concrete, real, in-production example before it has time to evaporate — which is the difference between “I read about ArgoCD once” and “I know where our ArgoCD is and what drift looks like on it.” Read-then-do, not read-then-read.

Four phases, because “learn DevOps” isn’t a task

A pile of pages still isn’t a plan, so the roadmap sits on top of both tracks and spends them over twenty weeks, in four phases, each with one blunt milestone:

Phase	Weeks	Milestone
Foundations	1–3	Can describe every component and monitor alerts
Operations	4–8	Can deploy a customer stack and restore a backup solo
Ownership	9–14	Can install a new customer from scratch
Mastery	15–20	Can train someone else

The milestones are deliberately verbs, not reading counts. Nobody is “done with Phase 2” because they finished the pages. They’re done when they’ve restored a backup without me in the room. The last milestone is the one that matters most to me personally — can train someone else — because that’s the only state in which I’m allowed to be hit by a bus.

The readiness tracker, or: vibes don’t scale

Here’s the part I’m most attached to, because it’s the part that fixes the original problem. “Are you ready to own this?” answered by gut feel is exactly the tribal-knowledge trap I was trying to escape, just relocated into the new hire’s head.

So full ownership is broken into eight weighted domains, and at the end of every phase you score yourself against them — honestly — and then study your lowest numbers, not your favorites. It turns “do I know enough yet?” from a vibe into a number with a gap next to it. The same instinct I’d apply to a service I’m monitoring, pointed at a person’s readiness instead. You don’t get to feel ready. You get to be measurably less unready every three weeks.

What I’d tell the next me

The mistake I almost made was treating onboarding docs as a description of the system. They’re not. A description is ordered by how the machine is built. Onboarding has to be ordered by how a human learns — and a human learning a platform is running two processes at once, the general and the specific, and you have to feed both without starving either.

Splitting the knowledge base in two felt like more work and more surface to maintain. It was the opposite. Now when the tech changes, I edit Track 1. When we change, I edit Track 2. The seam that makes it easy to read is the same seam that makes it easy to keep alive.

The handover I got was a person. The handover I’m leaving is a map — and it’s drawn so the next person can read it without me standing behind them. That was the entire goal. The fact that I can now point a brand-new hire at a URL instead of at my calendar is just the proof it worked.