<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Devops on hippotion</title><link>https://blog.hippotion.com/tags/devops/</link><description>Recent content in Devops on hippotion</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Sun, 21 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.hippotion.com/tags/devops/index.xml" rel="self" type="application/rss+xml"/><item><title>📝 Dev Notes</title><link>https://blog.hippotion.com/posts/dev-notes/</link><pubDate>Sun, 21 Jun 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/dev-notes/</guid><description>Running notes on things I&amp;rsquo;ve hit, fixed, or found worth remembering.</description><content:encoded><![CDATA[<h2 id="kubernetes-init-container-crash-loop-leaves-dirty-emptydir">Kubernetes: init container crash loop leaves dirty emptyDir</h2>
<p>When a pod&rsquo;s init container crashes, Kubernetes restarts <strong>only the init container</strong> — not the whole pod. The <code>emptyDir</code> volume survives between retries. If your init container does a <code>git clone</code> into a fixed path, the second attempt fails with &ldquo;destination path already exists.&rdquo;</p>
<p>Fix: <code>rm -rf</code> the target dir before cloning.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">rm -rf /git/repo
</span></span><span class="line"><span class="cl">git clone --depth<span class="o">=</span><span class="m">10</span> --branch<span class="o">=</span>main https://... /git/repo
</span></span></code></pre></div><p>After many restarts, no manual cleanup needed. Events expire in ~1h, old pods are replaced automatically by the Deployment controller. Check recovery with:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">kubectl get events -n &lt;namespace&gt; --sort-by<span class="o">=</span><span class="s1">&#39;.lastTimestamp&#39;</span> <span class="p">|</span> tail -10
</span></span></code></pre></div><h2 id="a-cpu-spike-that-was-actually-memory-thrashing-adding-ga4-to-hugo">A &ldquo;CPU spike&rdquo; that was actually memory thrashing (adding GA4 to Hugo)</h2>
<p>Wanted Google Analytics on this blog. PaperMod already calls a <code>google_analytics.html</code> partial in <code>head.html</code>, but it&rsquo;s gated behind <code>hugo.IsProduction | or (eq site.Params.env &quot;production&quot;)</code>. My blog pod runs <code>hugo server</code>, which <strong>always</strong> reports the environment as <em>development</em> — so the partial never fires. I &ldquo;fixed&rdquo; that by setting <code>env = &quot;production&quot;</code>.</p>
<p>That was the wrong lever. <code>env = production</code> flips on Hugo&rsquo;s whole production path — minification, OpenGraph, Twitter cards, schema JSON across every page. The next full rebuild blew past the pod&rsquo;s <strong>128Mi</strong> memory limit and got <strong>OOMKilled</strong> (exit 137). Server load jumped.</p>
<p>The right way to add GA without touching the build mode: drop the tag in <code>layouts/_partials/extend_head.html</code>. PaperMod includes that partial <em>unconditionally</em>, above the production guard — so it loads under <code>hugo server</code> too.</p>
<p>But here&rsquo;s the part that fooled me. After reverting <code>env</code>, load was <em>still</em> climbing — to ~14 on a single node — and <code>ps</code> showed hugo at &ldquo;500% CPU&rdquo;. Looked like a runaway compute loop. It wasn&rsquo;t:</p>
<pre tabindex="0"><code>%Cpu(s): 2.1 us, 41.0 sy, 6.9 id, 50.0 wa     &lt;- 50% iowait, 2% userspace
PID ... S  %CPU  COMMAND
... D  333  hugo    &lt;- state D, RES pinned at 127MiB (the 128Mi cgroup limit)
</code></pre><p>Two lessons:</p>
<ol>
<li><strong><code>ps %CPU</code> is a lifetime average</strong>, not instantaneous. A process that ran hot for 1s then blocked still shows a big number for a while. Use <code>top</code> for what&rsquo;s happening <em>now</em>.</li>
<li><strong>High load + high <code>%wa</code> + a <code>D</code>-state process sitting at its cgroup memory limit = memory thrashing, not CPU.</strong> Hugo wasn&rsquo;t computing — it was wedged against the 128Mi ceiling, and every allocation triggered kernel reclaim/swap. A sub-second build dragged out for minutes in uninterruptible I/O sleep, and all those blocked tasks are what inflate load average (Linux counts <code>D</code>-state in load).</li>
</ol>
<p>The actual fix was boring: 128Mi was always marginal for <code>hugo-extended</code> + PaperMod. Bumped the limit to 512Mi and the thrash vanished.</p>
<p>Takeaway: when load spikes, read <code>%wa</code> and process state before blaming the CPU. And don&rsquo;t flip <code>env=production</code> on a long-lived <code>hugo server</code> just to ungate one partial — use <code>extend_head.html</code>.</p>
<h2 id="self-hosting-supabase-lean-on-k3s-the-gotcha-checklist">Self-hosting Supabase (lean) on k3s: the gotcha checklist</h2>
<p>Ran the community <code>supabase/supabase</code> chart on a 16Gi single node — enabled db, rest, auth, meta, studio, kong + the log pipeline (analytics/Logflare + vector); left realtime, storage, imgproxy, edge-functions off. The deploy is easy; these are the things that actually bit:</p>
<ul>
<li><strong>Studio shows &ldquo;no tables&rdquo;.</strong> Supabase is single-database by design — Studio, PostgREST and auth all use the database named <code>postgres</code>. App tables in a <em>separate</em> database are invisible to all of it. Put your schema in <code>postgres</code>&rsquo;s <code>public</code> schema.</li>
<li><strong>Studio won&rsquo;t schedule with edge-functions disabled.</strong> Its Deployment mounts the functions PVC unconditionally. Either run functions, or create the PVC yourself and leave functions off.</li>
<li><strong>edge-functions crashloops</strong> if you keep it: it boots by fetching a Deno module from the internet, which a deny-all egress policy blocks. You usually only want the PVC it leaves behind anyway.</li>
<li><strong>vector (log collector) stays silent</strong> under a deny-all policy. It discovers pods via the Kubernetes API, so it needs <strong>API egress</strong>, not just app ports (<code>allowEgressToKubeApi</code>). A log shipper that can&rsquo;t reach the API collects nothing and doesn&rsquo;t say why.</li>
<li><strong><code>secretRef</code> must contain <em>every</em> key the chart maps</strong> — including non-secret ones like <code>database</code> and <code>openAiApiKey</code>. Miss one and pods sit in <code>CreateContainerConfigError</code>.</li>
<li><strong>ESO <code>ExternalSecret</code> shows perpetual <code>OutOfSync</code> in Argo CD</strong> unless you spell out the remoteRef defaults (<code>conversionStrategy: Default</code>, <code>decodingStrategy: None</code>, <code>metadataPolicy: None</code>) — ESO writes them back, and the compact form drifts.</li>
<li><strong><code>postgres</code> is not a superuser.</strong> <code>CREATE DATABASE … OWNER app</code> fails with <code>must be member of role</code>. Supabase keeps the real superuser (<code>supabase_admin</code>) to itself; <code>GRANT app TO postgres</code> first.</li>
<li><strong>Logflare needs no BigQuery.</strong> It runs on the self-hosted Postgres backend (the <code>_supabase</code> database, <code>_analytics</code> schema) — logs land in <code>_analytics.log_events_*</code>.</li>
</ul>
<p>None of this is in the README. It&rsquo;s the gap between &ldquo;I deployed Supabase&rdquo; and &ldquo;I run it.&rdquo;</p>
]]></content:encoded></item><item><title>🎯 Know the Market Without Job-Hunting: An LLM-Scored Job Poller in n8n</title><link>https://blog.hippotion.com/posts/ats-job-poller/</link><pubDate>Fri, 13 Feb 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/ats-job-poller/</guid><description>You don&amp;rsquo;t have to be job-hunting to want to know your market — what&amp;rsquo;s out there, what it pays, where you&amp;rsquo;d fit. So I built an n8n workflow: it polls the public ATS APIs (Greenhouse/Lever/Ashby) plus a broad remote-jobs feed, filters for remote-EU infra roles, scores each posting against my CV with an LLM, and emails me only the 80%+ matches. No database, no scraping.</description><content:encoded><![CDATA[<p>You don&rsquo;t have to be about to change jobs to want to know the landscape. What&rsquo;s being built, what it
pays, where you&rsquo;d actually fit — staying current on the market (and your own worth) is just good
professional hygiene. The trouble is that <em>checking</em> is tedious, so most of us don&rsquo;t, until we&rsquo;re
already job-hunting and starting cold.</p>
<p>So I automated mine. An <a href="https://n8n.io">n8n</a> workflow on my homelab polls job boards every six hours,
scores each new posting against my profile with an LLM, and emails me only the strong matches — the
ones scoring 80%+. When it&rsquo;s quiet, it&rsquo;s silent. When something genuinely fits, I know the same day.
Here&rsquo;s what I learned building it. Repo at the bottom.</p>
<h2 id="three-apis-cover-most-of-the-market">Three APIs cover most of the market</h2>
<p>Company career pages look bespoke, but underneath, the vast majority run on one of three ATS — and
all three hand you the jobs as unauthenticated JSON:</p>
<ul>
<li><strong>Greenhouse</strong> — <code>boards-api.greenhouse.io/v1/boards/{token}/jobs?content=true</code></li>
<li><strong>Lever</strong> — <code>api.lever.co/v0/postings/{token}?mode=json</code></li>
<li><strong>Ashby</strong> — <code>api.ashbyhq.com/posting-api/job-board/{token}?includeCompensation=true</code></li>
</ul>
<p>No scraping, no headless browser. You poll the API the page itself calls, normalize the three
shapes into one <code>{ company, title, location, remote, url, posted_at, description, external_id }</code>, and
you&rsquo;re done with the hard part.</p>
<h2 id="resolve-the-token-is-half-the-battle">&ldquo;Resolve the token&rdquo; is half the battle</h2>
<p>The naive assumption — <em>the token is the company name, and everyone&rsquo;s on one of the three</em> — is half
right. When I probed my initial wishlist, <strong>roughly half 404&rsquo;d everywhere</strong>: HashiCorp (now under
IBM → Workday), SUSE (SuccessFactors), Aiven (Teamtailor), Hugging Face. They&rsquo;re on a fourth or fifth
system entirely. The honest move was to ship the ~33 that actually resolve and leave the rest as
disabled config stubs. Verify before you trust a slug.</p>
<h2 id="dedup-without-a-database">Dedup without a database</h2>
<p>I didn&rsquo;t want to stand up Postgres just to remember which jobs I&rsquo;d already seen. n8n&rsquo;s <strong>Data Tables</strong>
handle it natively: a <code>seen_jobs</code> table, an <code>external_id</code> namespaced <code>{ats}:{company}:{id}</code>, and the
<code>rowNotExists</code> operation drops anything already recorded. State lives inside n8n, backed up with it.
Zero extra infrastructure.</p>
<p>The ordering matters: <strong>notify first, mark seen second.</strong> The insert only happens after the email
sends, so a failed send retries next run instead of silently swallowing a posting.</p>
<h2 id="the-location-filter-is-a-trap">The location filter is a trap</h2>
<p>My first version kept everything that wasn&rsquo;t explicitly US-based. The inbox filled with <em>&ldquo;Senior
Platform Engineer — Spain (Remote)&rdquo;</em> and <em>&quot;… — United Kingdom (Remote)&quot;</em>. Those aren&rsquo;t remote-for-me
— they&rsquo;re remote <em>if you live in Spain</em>. Useless from where I sit.</p>
<p>The fix was to invert the logic. Keep only three things:</p>
<ul>
<li>globally-remote / worldwide / anywhere,</li>
<li>pan-EU (EMEA / Europe / EU / EEA),</li>
<li>my own country.</li>
</ul>
<p>…and <strong>drop single-country remote</strong>, even EU ones. Region and home matches win over the country
deny-list, ambiguous locations are kept (a missed match is worse than one extra line to skim). That
one change cut the noise more than anything else.</p>
<h2 id="let-an-llm-read-the-actual-job">Let an LLM read the actual job</h2>
<p>Keyword + location filtering gets you a candidate list, but it can&rsquo;t tell a &ldquo;Platform Engineer&rdquo; who
herds Kubernetes from a &ldquo;Platform Engineer&rdquo; who owns a Figma design system. The job description can.</p>
<p>So the last step scores each new posting against my CV. My first version batched all of them into
<strong>one</strong> big LLM call — which promptly timed out on the free tier. The fix was the opposite: <strong>one
small call per job</strong>, which also means a single slow or rate-limited job never sinks the batch. Each
call asks a <a href="https://build.nvidia.com">NVIDIA NIM</a> model (Llama 3.1 8B, OpenAI-compatible) for one
number and a reason:</p>
<blockquote>
<p>Score this job 0–100 for fit against my profile. Return <code>{score, reason}</code>.</p>
</blockquote>
<p>That score is what lets me <strong>widen the net instead of narrowing it.</strong> On top of the curated company
list I pull a broad remote-jobs feed (every company, all categories); the cheap keyword + location
filters do the first pass, then I <strong>only email the roles scoring 80%+.</strong> Casting wide is fine when a
model is the bar at the door. A line ends up looking like:</p>
<blockquote>
<p><strong>92%</strong> — <em>Grafana Labs</em> — Senior Platform Engineer (Remote, EMEA) — <em>strong k8s/GitOps overlap</em> — link</p>
</blockquote>
<p>Scoring is fail-safe: if a call hiccups, that job is just skipped, and every posting gets marked seen
either way — so nothing re-scores forever, and a rare bad run never floods or stalls the inbox.</p>
<h2 id="the-unglamorous-bits-that-make-it-trustworthy">The unglamorous bits that make it trustworthy</h2>
<ul>
<li><strong>One bad source can&rsquo;t kill the run</strong> — every fetch is wrapped; failures become a <code>⚠️ N sources failing</code> footer so a company quietly changing ATS is visible, not invisible.</li>
<li><strong>A prime run</strong> seeds the table silently the first time, so I&rsquo;m not buried under every currently-open
role on day one.</li>
<li><strong>Everything tunable lives in one Config node</strong> — companies, keywords, location lists, the profile,
the model — so adding a company is a one-line edit, not a graph safari.</li>
</ul>
<h2 id="takeaways">Takeaways</h2>
<ul>
<li>The &ldquo;scrape job boards&rdquo; problem mostly isn&rsquo;t a scraping problem — it&rsquo;s three public APIs and a
normalizer.</li>
<li>For personal automation, reach for the boring-but-correct primitive: native dedup state beats a
database you have to operate.</li>
<li>An LLM works best here as the <strong>bar at the door</strong>: cheap deterministic filters keep the candidate
set (and the cost) small, then the model gates on real fit — which is what lets you cast a wide net
without drowning in it.</li>
</ul>
<p>Workflow JSON, the full node-by-node breakdown, and setup notes:
<strong><a href="https://github.com/janos-gyorgy/ats-job-poller">github.com/janos-gyorgy/ats-job-poller</a></strong>.</p>
]]></content:encoded></item><item><title>📦 Five Ways to Manage Kubernetes Manifests (and Why They're Not All Equal)</title><link>https://blog.hippotion.com/posts/gitops-manifest-approaches/</link><pubDate>Fri, 10 Oct 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/gitops-manifest-approaches/</guid><description>Raw YAML, Kustomize, Helm, Jsonnet — there&amp;rsquo;s more than one way to describe what you want running in a cluster. Here&amp;rsquo;s what each actually looks like in practice and where each one breaks.</description><content:encoded><![CDATA[<h2 id="the-problem-everyone-hits">The problem everyone hits</h2>
<p>You&rsquo;ve got a Kubernetes cluster. Now you need to describe what should run in it. You write some YAML, apply it, it works.</p>
<p>Then you need a second environment. Or a second service. Or someone else joins the project and asks &ldquo;how do I add an app to this?&rdquo; and you don&rsquo;t have a good answer.</p>
<p>This is the manifest management problem, and there are five common solutions — ranging from &ldquo;this works until it doesn&rsquo;t&rdquo; to &ldquo;this is what production platforms actually look like.&rdquo;</p>
<hr>
<h2 id="approach-1-raw-manifests">Approach 1: Raw manifests</h2>
<p>The starting point for almost everyone. Write a YAML file, <code>kubectl apply -f</code>, done.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">apps/v1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">Deployment</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">myapp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">myapp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">replicas</span><span class="p">:</span><span class="w"> </span><span class="m">1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">selector</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">matchLabels</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">app</span><span class="p">:</span><span class="w"> </span><span class="l">myapp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">template</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">labels</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">app</span><span class="p">:</span><span class="w"> </span><span class="l">myapp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">containers</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">myapp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">image</span><span class="p">:</span><span class="w"> </span><span class="l">myapp:v1.2.3</span><span class="w">
</span></span></span></code></pre></div><p><strong>Where it works:</strong> one service, one environment, learning Kubernetes. The feedback loop is immediate — write YAML, see what happens.</p>
<p><strong>Where it breaks:</strong></p>
<ul>
<li><strong>No templating.</strong> Want to change the image tag across ten services? Ten files, ten edits, ten chances to get it wrong.</li>
<li><strong>Live state leaks in.</strong> If you export existing resources with <code>kubectl get -o yaml</code>, you get <code>resourceVersion</code>, <code>generation</code>, <code>creationTimestamp</code>, and <code>managedFields</code> in the output. Commit that to Git and you&rsquo;ve created a permanent source of conflicts — ArgoCD compares what&rsquo;s in Git against what&rsquo;s in the cluster, sees stale version counters, and the diff never clears.</li>
<li><strong>Copy-paste hell.</strong> A Deployment, a Service, an IngressRoute, a ServiceAccount, a NetworkPolicy — five files per app. Add a new app, copy five files, change the names, forget to update one. This is how environments drift apart silently.</li>
</ul>
<p>The fix for the live-state problem is: only commit desired state. Strip every field that Kubernetes manages internally back to its clean spec. It&rsquo;s tedious and easy to forget, which is exactly why people move on from raw manifests.</p>
<hr>
<h2 id="approach-2-kustomize">Approach 2: Kustomize</h2>
<p>Kustomize is built into <code>kubectl</code> (<code>kubectl apply -k</code>) and natively supported by ArgoCD. The idea: you have a <code>base/</code> with your raw manifests, and overlays that patch on top of them for different environments.</p>
<pre tabindex="0"><code>app/
├── base/
│   ├── deployment.yaml
│   ├── service.yaml
│   └── kustomization.yaml
└── overlays/
    ├── staging/
    │   ├── kustomization.yaml    # patches replicas to 1, image to :staging
    └── production/
        └── kustomization.yaml    # patches replicas to 3, image to :v1.2.3
</code></pre><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># overlays/production/kustomization.yaml</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">resources</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">../../base</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">patches</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">patch</span><span class="p">:</span><span class="w"> </span><span class="p">|-</span><span class="sd">
</span></span></span><span class="line"><span class="cl"><span class="sd">      - op: replace
</span></span></span><span class="line"><span class="cl"><span class="sd">        path: /spec/replicas
</span></span></span><span class="line"><span class="cl"><span class="sd">        value: 3</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">target</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">Deployment</span><span class="w">
</span></span></span></code></pre></div><p><strong>Where it works:</strong> multi-environment setups where the difference between environments is mostly configuration values, not structure. Kustomize is good at this — you write the base once and patch only what differs.</p>
<p><strong>Where it breaks:</strong></p>
<ul>
<li><strong>No real parameterization.</strong> Kustomize patches are surgical edits, not templates. If your base structure needs to vary (different resource shapes per environment, conditional blocks), you&rsquo;re fighting the tool.</li>
<li><strong>Patching deep structures is ugly.</strong> JSON patches on nested YAML are verbose and hard to read. You end up writing more patch YAML than it would take to just copy the file.</li>
<li><strong>Still repetitive across apps.</strong> Each app still gets its own base directory. You&rsquo;re not abstracting the shared patterns across apps, only the differences between environments of the same app.</li>
</ul>
<p>Kustomize is a significant step up from raw manifests for multi-environment setups. For complex templating or platform-level abstractions, it runs out of power quickly.</p>
<hr>
<h2 id="approach-3-helm">Approach 3: Helm</h2>
<p>Helm adds real templating. Charts are parameterized bundles — templates with variables, conditionals, and loops — and values files supply the parameters.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># templates/deployment.yaml</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">apps/v1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">Deployment</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span>{{<span class="w"> </span><span class="l">.Values.name }}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span>{{<span class="w"> </span><span class="l">.Release.Namespace }}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">replicas</span><span class="p">:</span><span class="w"> </span>{{<span class="w"> </span><span class="l">.Values.replicas | default 1 }}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">template</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">containers</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span>{{<span class="w"> </span><span class="l">.Values.name }}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">image</span><span class="p">:</span><span class="w"> </span>{{<span class="w"> </span><span class="l">.Values.image.repository }}:{{ .Values.image.tag }}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span>{{- <span class="l">if .Values.resources }}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">resources</span><span class="p">:</span><span class="w"> </span>{{<span class="w"> </span><span class="l">.Values.resources | toYaml | nindent 12 }}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span>{{- <span class="l">end }}</span><span class="w">
</span></span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># values-production.yaml</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">myapp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">replicas</span><span class="p">:</span><span class="w"> </span><span class="m">3</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">image</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">repository</span><span class="p">:</span><span class="w"> </span><span class="l">myorg/myapp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">tag</span><span class="p">:</span><span class="w"> </span><span class="l">v1.2.3</span><span class="w">
</span></span></span></code></pre></div><p>Helm renders the templates at deploy time. What lands in the cluster is clean rendered YAML — no internal state, no conflicts.</p>
<p><strong>Where it works:</strong> almost everywhere. The Helm Hub has charts for most common software already. For custom apps, writing a chart once and parameterizing per-environment is straightforwardly better than copying YAML.</p>
<p><strong>Where it breaks:</strong></p>
<ul>
<li><strong>Chart authoring is verbose.</strong> Writing a Helm chart from scratch involves a lot of Go templating boilerplate. For a simple app, it can feel like more scaffolding than application.</li>
<li><strong>Debugging rendered output is annoying.</strong> <code>helm template</code> is your friend, but errors in templates produce unhelpful messages. The indentation rules (<code>nindent</code>, <code>indent</code>, <code>toYaml</code>) have sharp edges.</li>
<li><strong>Values files still pile up.</strong> If every app has its own values file and there&rsquo;s no shared structure between them, you&rsquo;re back to copy-paste but now in YAML-that-configures-YAML.</li>
</ul>
<p>Helm is the right tool for most Kubernetes deployments. The ecosystem support alone (upstream charts for Postgres, Redis, Vault, every CNCF project) makes it the pragmatic default.</p>
<hr>
<h2 id="approach-4-jsonnet--cue">Approach 4: Jsonnet / CUE</h2>
<p>For teams that need programmatic config generation — actual code, not templates — Jsonnet and CUE are the serious alternatives.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-jsonnet" data-lang="jsonnet"><span class="line"><span class="cl"><span class="c1">// deployment.jsonnet
</span></span></span><span class="line"><span class="cl"><span class="k">local</span><span class="w"> </span><span class="nv">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">import</span><span class="w"> </span><span class="s">&#34;k.libsonnet&#34;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">local</span><span class="w"> </span><span class="nf">deployment</span><span class="p">(</span><span class="nv">name</span><span class="p">,</span><span class="w"> </span><span class="nv">image</span><span class="p">,</span><span class="w"> </span><span class="nv">replicas</span><span class="o">=</span><span class="mf">1</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nv">k</span><span class="p">.</span><span class="nv">apps</span><span class="p">.</span><span class="nv">v1</span><span class="p">.</span><span class="nv">deployment</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="nv">name</span><span class="p">,</span><span class="w"> </span><span class="nv">replicas</span><span class="p">,</span><span class="w"> </span><span class="p">[</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nv">k</span><span class="p">.</span><span class="nv">core</span><span class="p">.</span><span class="nv">v1</span><span class="p">.</span><span class="nv">container</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="nv">name</span><span class="p">,</span><span class="w"> </span><span class="nv">image</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="p">]);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nv">&#34;deployment.yaml&#34;</span><span class="p">:</span><span class="w"> </span><span class="nf">deployment</span><span class="p">(</span><span class="s">&#34;myapp&#34;</span><span class="p">,</span><span class="w"> </span><span class="s">&#34;myorg/myapp:v1.2.3&#34;</span><span class="p">,</span><span class="w"> </span><span class="nv">replicas</span><span class="o">=</span><span class="mf">3</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w">
</span></span></span></code></pre></div><p><strong>Where it works:</strong> large platforms where configuration is genuinely complex — many environments, many apps, deep interdependencies. Jsonnet lets you write real functions, share libraries, compose abstractions properly.</p>
<p><strong>Where it breaks:</strong></p>
<ul>
<li><strong>Steep learning curve.</strong> Jsonnet is a full language. CUE even more so — it has types, schemas, and a constraint system that takes time to internalise.</li>
<li><strong>Small community.</strong> Excellent tooling, but you&rsquo;re solving problems that have fewer Stack Overflow answers.</li>
<li><strong>Overkill for most setups.</strong> If you&rsquo;re not managing hundreds of services across multiple clusters, Helm is simpler and has everything you need.</li>
</ul>
<p>Jsonnet is used seriously at Google-scale infrastructure teams and in some CNCF projects. For a homelab or a small-to-medium platform, it&rsquo;s the right answer to a question you probably aren&rsquo;t asking yet.</p>
<hr>
<h2 id="approach-5-app-of-apps-with-generated-application-crds">Approach 5: App-of-apps with generated Application CRDs</h2>
<p>This is the ArgoCD-native meta-layer. Instead of managing manifests, you manage <code>Application</code> resources — and potentially use a chart or tool to generate those too.</p>
<p>A naive version: commit a folder of <code>Application</code> YAML files to Git, one per service. ArgoCD watches the folder and deploys each app.</p>
<p>A more sophisticated version: one &ldquo;root app&rdquo; that points to a chart, which generates all the other <code>Application</code> resources dynamically from a single config file.</p>
<p><strong>Where it works:</strong> at the platform level, not the individual app level. App-of-apps is how you manage what ArgoCD manages, not how you write the service manifests themselves. Combined with Helm, it gives you centralized control over the entire cluster&rsquo;s structure.</p>
<p><strong>Where it breaks:</strong></p>
<ul>
<li><strong>Manual <code>Application</code> CRDs are painful.</strong> If you&rsquo;re maintaining a folder of hand-written <code>Application</code> YAML files — one per service — you&rsquo;ve traded manifest copy-paste for Application copy-paste. Each app needs its own CRD with its repo URL, path, sync policy, project reference.</li>
<li><strong>Sync ordering matters.</strong> The root app must exist before children can sync. Get the wave ordering wrong and apps try to deploy before their namespaces exist.</li>
</ul>
<hr>
<h2 id="how-this-homelab-compares">How this homelab compares</h2>
<p>My setup sits at the far end of approach 5, using Helm throughout.</p>
<p>There&rsquo;s a single <code>applications.yml</code> file that describes every service in the cluster. A root Helm chart reads it and generates all the ArgoCD <code>Application</code> and <code>AppProject</code> CRDs automatically. Adding a service means adding an entry to that file — not touching five different places across five different files.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># applications.yml — this is the entire service catalog</span><span class="w">
</span></span></span><span class="line"><span class="cl">- <span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">web-vaultwarden</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">networkPolicies</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">profile</span><span class="p">:</span><span class="w"> </span><span class="l">web-app</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">applications</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="nt">applicationCode</span><span class="p">:</span><span class="w"> </span><span class="l">web-vaultwarden</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">path</span><span class="p">:</span><span class="w"> </span><span class="l">helm-charts/extra-objects</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">autoSync</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
</span></span></span></code></pre></div><p>That one entry generates: a Namespace, an ArgoCD AppProject, an ArgoCD Application, a set of Cilium NetworkPolicies (deny-all with ingress from Traefik and DNS/HTTPS egress), and a ServiceAccount. Nothing is written by hand.</p>
<p>The actual service manifests live in an <code>extra-objects</code> chart — a thin wrapper that renders raw YAML from values files. No templating in the service manifests themselves (they&rsquo;re simple enough not to need it), but the infrastructure scaffolding around each app is entirely generated.</p>
<p>The result: every service gets the same operational properties. Same GitOps workflow, same secret management, same network isolation, same TLS termination. The platform work was done once. Adding a new app is writing manifests for the app&rsquo;s specific behavior, not recreating the scaffolding.</p>
<hr>
<h2 id="the-honest-spectrum">The honest spectrum</h2>
<table>
	<thead>
			<tr>
					<th>Approach</th>
					<th>Templating</th>
					<th>Abstraction</th>
					<th>Ecosystem</th>
					<th>Complexity</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>Raw manifests</td>
					<td>None</td>
					<td>None</td>
					<td>None</td>
					<td>Low</td>
			</tr>
			<tr>
					<td>Kustomize</td>
					<td>Patches only</td>
					<td>Overlays</td>
					<td>Medium</td>
					<td>Low-medium</td>
			</tr>
			<tr>
					<td>Helm</td>
					<td>Full</td>
					<td>Per-chart</td>
					<td>Large</td>
					<td>Medium</td>
			</tr>
			<tr>
					<td>Jsonnet/CUE</td>
					<td>Full + typed</td>
					<td>Libraries</td>
					<td>Small</td>
					<td>High</td>
			</tr>
			<tr>
					<td>App-of-apps</td>
					<td>Depends</td>
					<td>Platform-level</td>
					<td>ArgoCD-native</td>
					<td>High</td>
			</tr>
	</tbody>
</table>
<p>Most setups should start at Helm. Kustomize if you&rsquo;re multi-environment and comfortable with patching. App-of-apps when you&rsquo;re managing the platform layer, not individual services. Jsonnet/CUE when you know you&rsquo;ve outgrown Helm — which is a specific and relatively rare problem to have.</p>
<p>Raw manifests are fine for learning. They&rsquo;re the wrong answer for anything you intend to maintain.</p>
<hr>
<p><em>More on how the homelab is structured: <a href="/posts/homelab-gitops/">My Homelab Runs on GitOps</a>.</em></p>
]]></content:encoded></item><item><title>🚨 Don't Restart the Node. Quarantine It First.</title><link>https://blog.hippotion.com/posts/dont-restart-quarantine-first/</link><pubDate>Fri, 01 Aug 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/dont-restart-quarantine-first/</guid><description>Rebooting a misbehaving node feels productive. It isn&amp;rsquo;t. You&amp;rsquo;re erasing your evidence and skipping the lesson.</description><content:encoded><![CDATA[<h2 id="the-reflex">The reflex</h2>
<p>Something&rsquo;s wrong. A GitLab runner stops picking up jobs. An event processor starts dropping messages. A pod restarts in a loop. The node looks healthy — CPU fine, memory fine — but something is clearly off.</p>
<p>The reflex: restart the node, see if it clears.</p>
<p>Sometimes it does clear, and you move on. But you didn&rsquo;t fix anything. You reset the state and crossed your fingers. If it happens again in two weeks, you&rsquo;ll do the same thing. After enough iterations you have a &ldquo;flaky node&rdquo; that everyone reboots periodically and nobody understands.</p>
<p>There&rsquo;s a better sequence. It takes twenty minutes instead of two, and you come out with either a real fix or actual knowledge of what happened.</p>
<hr>
<h2 id="step-one-quarantine-dont-kill">Step one: quarantine, don&rsquo;t kill</h2>
<p>Before you touch anything, take the node out of rotation without destroying its current state.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">kubectl cordon &lt;node&gt;
</span></span></code></pre></div><p>Cordon marks the node as unschedulable. No new pods land on it. Existing pods keep running. If you need the workloads somewhere else immediately:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">kubectl drain &lt;node&gt; --ignore-daemonsets --delete-emptydir-data
</span></span></code></pre></div><p>Now you&rsquo;ve removed the node from production traffic without rebooting. The node is still alive. Everything that happened on it is still there: logs, open files, kernel ring buffer, running processes, memory state.</p>
<p>This is the difference. A reboot wipes that. A cordon preserves it.</p>
<hr>
<h2 id="step-two-look-at-whats-actually-there">Step two: look at what&rsquo;s actually there</h2>
<p>SSH in. Don&rsquo;t grep for anything specific yet — do a pass for anything unusual.</p>
<p><strong>Kernel messages first.</strong> The kernel will often tell you exactly what went wrong before any application did.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">dmesg -T --level<span class="o">=</span>err,warn <span class="p">|</span> tail -50
</span></span></code></pre></div><p>OOM kills show up here. Disk errors show up here. CPU soft lockups show up here. If you&rsquo;ve got any of those, you have your answer before you&rsquo;ve even looked at application logs.</p>
<p><strong>Check for filesystem problems.</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">df -h          <span class="c1"># is anything full?</span>
</span></span><span class="line"><span class="cl">dmesg <span class="p">|</span> grep -i <span class="s2">&#34;ext4\|xfs\|btrfs\|i/o error\|ata&#34;</span>
</span></span></code></pre></div><p>A filesystem at 100% is silent until it isn&rsquo;t. A flaky drive starts dropping I/O errors into dmesg long before SMART reports anything. Application developers rarely think about this case — their app just starts writing logs that say &ldquo;failed to write&rdquo; without specifying that the disk is full or dying.</p>
<p><strong>System resource pressure.</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">vmstat <span class="m">1</span> <span class="m">5</span>          <span class="c1"># is there swap activity?</span>
</span></span><span class="line"><span class="cl">iostat -x <span class="m">1</span> <span class="m">5</span>       <span class="c1"># is a disk saturated?</span>
</span></span><span class="line"><span class="cl">cat /proc/pressure/io   <span class="c1"># kernel PSI — pressure stall info</span>
</span></span></code></pre></div><p>PSI is underused. It tells you whether processes were actually stalled waiting for I/O, not just whether throughput was high. A disk at 80% utilisation might be fine; a disk with 40% I/O PSI pressure is actively hurting performance.</p>
<p><strong>What were the pods doing right before things went sideways?</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">kubectl describe node &lt;node&gt;    <span class="c1"># events section at the bottom</span>
</span></span><span class="line"><span class="cl">kubectl get events --field-selector involvedObject.kind<span class="o">=</span>Pod -A <span class="p">|</span> sort -k1
</span></span></code></pre></div><p>Look for OOMKilled exits, failed liveness probes, and throttling events. Kubernetes events expire after an hour by default — another reason not to reboot immediately; those events are still there if you look now.</p>
<hr>
<h2 id="a-real-example-the-gitlab-runner">A real example: the GitLab runner</h2>
<p>A GitLab runner pod stops picking up jobs. It looks alive — the process is running, no crashes in the pod logs. Jobs sit in the queue.</p>
<p>Restart reflex: delete the pod, let it reschedule, it picks up jobs again.</p>
<p>But why did it stop?</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">journalctl -u gitlab-runner --since <span class="s2">&#34;1 hour ago&#34;</span>
</span></span><span class="line"><span class="cl"><span class="c1"># or, if it&#39;s a container:</span>
</span></span><span class="line"><span class="cl">kubectl logs &lt;runner-pod&gt; --previous
</span></span></code></pre></div><p>In one instance: the runner&rsquo;s working directory was on a tmpfs that hit its size limit. The runner silently failed to create job workspaces and stopped accepting new jobs. The error was one line in the pod logs: <code>mkdir /builds: no space left on device</code>. The pod was healthy by every other metric.</p>
<p>Fix: bump the tmpfs size limit in the runner config. The restart would have cleared tmpfs temporarily, and the runner would have failed again the next time a large job filled it up.</p>
<p>The debug took five minutes. The permanent fix took two minutes. Without quarantining the node first, the evidence was gone.</p>
<hr>
<h2 id="another-one-the-event-consumer">Another one: the event consumer</h2>
<p>An event processor starts falling behind. Messages queue up. The pod shows no errors. Memory looks fine.</p>
<p>This one was subtler: the processor was connected to a downstream dependency over a persistent TCP connection. The connection had gone into a half-open state — the processor thought it was alive, the remote end had already dropped it. New messages were being sent into a dead socket and silently discarded.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">ss -tnp <span class="p">|</span> grep &lt;pid&gt;    <span class="c1"># look at the socket state</span>
</span></span></code></pre></div><p><code>CLOSE_WAIT</code> on a connection that should be <code>ESTABLISHED</code>. The application wasn&rsquo;t checking whether the connection was actually working before using it, just whether it existed.</p>
<p>Restart would have cleared the socket state, fixed the symptom, and left the bug in the code.</p>
<hr>
<h2 id="what-to-look-for--a-short-checklist">What to look for — a short checklist</h2>
<p>When a node is misbehaving, in order:</p>
<ol>
<li><code>dmesg -T --level=err,warn</code> — kernel errors, OOM kills, disk errors</li>
<li><code>df -h &amp;&amp; df -i</code> — full filesystems (space and inodes separately)</li>
<li><code>kubectl describe node &lt;node&gt;</code> — pressure conditions, recent events</li>
<li><code>kubectl logs &lt;pod&gt; --previous</code> — what the pod logged before it died or got stuck</li>
<li><code>ss -tnp</code> — socket states for network-adjacent issues</li>
<li><code>vmstat 1 5</code> + <code>iostat -x 1 5</code> — resource pressure</li>
<li><code>journalctl -p err -b</code> — system journal errors since last boot</li>
</ol>
<p>Most problems show up in the first three.</p>
<hr>
<h2 id="after-youve-found-something-or-not-found-something">After you&rsquo;ve found something (or not found something)</h2>
<p><strong>If you found the cause:</strong> fix it, test it, uncordon the node.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">kubectl uncordon &lt;node&gt;
</span></span></code></pre></div><p>Document what you found — a comment in the relevant config, a commit message, a note. &ldquo;Fixed runner tmpfs limit&rdquo; in the commit history is more useful than &ldquo;flaky runner, restarted.&rdquo;</p>
<p><strong>If you genuinely found nothing:</strong> that&rsquo;s information too. Cordon, reboot, uncordon, and note that the node rebooted clean with no identified cause. If it happens again, you have a pattern. Check whether anything changed in the workloads around that time. Check whether the reboot timing correlates with anything — cron jobs, backups, maintenance windows.</p>
<p>A reboot you can explain is a fix. A reboot you can&rsquo;t explain is a time bomb.</p>
<hr>
<h2 id="why-this-matters-on-a-single-node-cluster">Why this matters on a single-node cluster</h2>
<p>In a multi-node setup you can afford to be lazier — cordon, drain, reboot, let the scheduler handle it, look at it later. On a single node there&rsquo;s no &ldquo;later.&rdquo; The node coming back is all you&rsquo;ve got.</p>
<p>But the habit is worth building regardless of node count. The engineers who understand their systems are the ones who looked before they rebooted.</p>
<hr>
<h2 id="the-actual-rule">The actual rule</h2>
<p><strong>Quarantine first. Debug second. Restart third (if you still need to).</strong></p>
<p>A restart takes two minutes. The evidence it destroys might take two hours to reconstruct — or might be gone for good. The cordon costs you nothing.</p>
]]></content:encoded></item><item><title>🏗️ My Homelab Runs on GitOps. Here's What That Actually Means.</title><link>https://blog.hippotion.com/posts/homelab-gitops/</link><pubDate>Fri, 28 Mar 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/homelab-gitops/</guid><description>I wanted to learn production-grade Kubernetes patterns without breaking production. One node, a full GitOps stack, and a hard rule: no manual kubectl after bootstrap.</description><content:encoded><![CDATA[<h2 id="why-this-exists">Why this exists</h2>
<p>I&rsquo;ve been working in DevOps and platform engineering long enough to know what I don&rsquo;t know. The patterns that separate robust infrastructure from &ldquo;it works on my machine&rdquo; infrastructure — GitOps, admission policies, network segmentation, secrets management — are easy to read about. They&rsquo;re harder to actually internalise without running them yourself.</p>
<p>So I built a homelab. An old ThinkCentre I had sitting around, k3s, and a rule I set for myself before writing a single line of configuration: <strong>GitLab is the only source of truth. No manual <code>kubectl</code> after bootstrap. All changes go through <code>git push</code>.</strong></p>
<p>That rule turned out to be more consequential than I expected.</p>
<hr>
<h2 id="the-stack">The stack</h2>
<p>The cluster runs about thirty services across two categories: infrastructure that makes the platform work, and applications that actually do things.</p>
<p>Infrastructure:</p>
<ul>
<li><strong>k3s</strong> — lightweight Kubernetes, single-node</li>
<li><strong>Cilium</strong> — CNI with NetworkPolicy support (Flannel, k3s&rsquo;s default, silently ignores NetworkPolicies)</li>
<li><strong>Argo CD</strong> — GitOps reconciler, watches the repo, applies changes</li>
<li><strong>Traefik</strong> — ingress controller, two entrypoints</li>
<li><strong>Cloudflare tunnel</strong> — external access without open ports</li>
<li><strong>cert-manager</strong> — wildcard TLS cert via Let&rsquo;s Encrypt DNS-01</li>
<li><strong>oauth2-proxy</strong> — GitLab SSO protecting everything by default</li>
<li><strong>Vault + External Secrets Operator</strong> — secrets management</li>
<li><strong>Pi-hole</strong> — local DNS for <code>*.hippotion.com</code></li>
</ul>
<p>Applications: a media server (Jellyfin, *arr stack), Immich for photos, Vaultwarden for passwords, Home Assistant, n8n for automation, a Hugo blog, Obsidian via browser-based KasmVNC, and a few custom-built things I&rsquo;ll get to below.</p>
<hr>
<h2 id="traffic-reaches-the-cluster-in-two-ways">Traffic reaches the cluster in two ways</h2>
<p>External traffic (from anywhere on the internet) goes through a Cloudflare tunnel. The cloudflared pod dials out to Cloudflare — no open ports on the server, no firewall rules, no exposed IP. Cloudflare terminates TLS and forwards plain HTTP to Traefik on port 7080. Cloudflare handles the certificate for external visitors.</p>
<p>Local traffic (home WiFi) goes through Pi-hole, which resolves <code>*.hippotion.com</code> to the server&rsquo;s LAN IP. Traefik receives HTTPS on port 443, served with a wildcard certificate that cert-manager issues from Let&rsquo;s Encrypt via DNS-01 challenge. Port 80 redirects to 443; the <code>cloudflare</code> entrypoint on 7080 does not redirect, because it&rsquo;s already receiving plain HTTP from cloudflared.</p>
<p>The result: the same IngressRoute handles both paths.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">entryPoints</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="l">cloudflare  </span><span class="w"> </span><span class="c"># plain HTTP from the cloudflared pod</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="l">websecure   </span><span class="w"> </span><span class="c"># local HTTPS with wildcard cert</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">routes</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="nt">match</span><span class="p">:</span><span class="w"> </span><span class="l">Host(`myapp.hippotion.com`)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">Rule</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">middlewares</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">oauth-auth</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">sys-oauth2-gitlab</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">services</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">myapp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">port</span><span class="p">:</span><span class="w"> </span><span class="m">8080</span><span class="w">
</span></span></span></code></pre></div><p>Every IngressRoute has both entrypoints. If you forget one, the service is unreachable from half your access paths. Learned that the first time I added an app and couldn&rsquo;t reach it from the phone.</p>
<hr>
<h2 id="one-file-generates-everything">One file generates everything</h2>
<p>The centrepiece of the setup is <code>applications.yml</code> — a single file that is the complete list of everything running in the cluster. Every entry generates a Namespace, an Argo CD AppProject, an Application, NetworkPolicies, and RBAC. Nothing is created anywhere else.</p>
<p>An entry looks like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl">- <span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">web-vaultwarden</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">networkPolicies</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">profile</span><span class="p">:</span><span class="w"> </span><span class="l">web-app</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">applications</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="nt">applicationCode</span><span class="p">:</span><span class="w"> </span><span class="l">web-vaultwarden</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">path</span><span class="p">:</span><span class="w"> </span><span class="l">helm-charts/extra-objects</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">autoSync</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
</span></span></span></code></pre></div><p>Six lines. That deploys a namespace, an Argo CD app that watches <code>helm-charts/extra-objects/values-web-vaultwarden.yml</code>, a full set of Cilium NetworkPolicies based on the <code>web-app</code> profile (deny-all with ingress from Traefik and egress to external), and a ServiceAccount. Adding a new service to the cluster is this file plus a values file with the actual Kubernetes manifests.</p>
<p>The <code>profile: web-app</code> notation deserves a word. Raw NetworkPolicy YAML is repetitive and error-prone — every namespace needs a deny-all base plus specific allows. I template it. A Helm chart maps profile names to concrete policy sets. <code>web-app</code> means: deny all ingress except from the ingress namespace, deny all egress except DNS and external HTTPS. <code>web-app-internal</code> means the same but no external egress — suitable for services that only talk to other in-cluster services. <code>media-server</code> adds port 6881 for BitTorrent. The policies are generated; no one writes them by hand.</p>
<hr>
<h2 id="secrets-without-storing-them-in-git">Secrets without storing them in Git</h2>
<p>Kubernetes <code>Secret</code> objects are not secrets. They&rsquo;re base64-encoded blobs in etcd, and base64 is not encryption. Committing them to a Git repo — even a private one — is the wrong answer.</p>
<p>The setup here uses HashiCorp Vault as the actual secret store, with External Secrets Operator syncing Vault paths to Kubernetes Secrets. What lives in Git is an <code>ExternalSecret</code> CRD:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">external-secrets.io/v1beta1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">ExternalSecret</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">myapp-credentials</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">myapp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">secretStoreRef</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">vault</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">ClusterSecretStore</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">target</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">myapp-credentials</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">data</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="nt">secretKey</span><span class="p">:</span><span class="w"> </span><span class="l">DB_PASSWORD</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">remoteRef</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">key</span><span class="p">:</span><span class="w"> </span><span class="l">secret/myapp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">property</span><span class="p">:</span><span class="w"> </span><span class="l">db-password</span><span class="w">
</span></span></span></code></pre></div><p>This is safe to commit. It says where the secret lives, not what it is. Vault contains the actual value. ESO syncs it to the cluster and refreshes every hour. Rotation means updating the value in Vault — no Git commit, no deployment.</p>
<p>Vault runs in-cluster with a sidecar that auto-unseals on restart. Not production-grade (the unseal key is on the same PVC as Vault itself), but pragmatic for a homelab where availability matters more than a sophisticated key management ceremony.</p>
<hr>
<h2 id="three-things-i-built-that-were-worth-building">Three things I built that were worth building</h2>
<h3 id="local-ai-inference">Local AI inference</h3>
<p>The cluster runs a local LLM. The <code>web-ai-engine</code> namespace has Open WebUI fronting a llama-server serving Phi-3.5 Mini in GGUF format. The model file lives on the node&rsquo;s filesystem, mounted as a hostPath volume.</p>
<p><code>web-openclaw</code> is a personal AI assistant UI that can route requests to either external providers (via NVIDIA&rsquo;s API) or the local llama-server, depending on the task. The local model handles things that don&rsquo;t need to leave the house; the external API handles things that do. The network policy for <code>web-openclaw</code> explicitly allows egress to <code>web-ai-engine</code> and nowhere else for local inference.</p>
<p>Running a 3.8B parameter model on homelab hardware is genuinely useful and costs nothing per query. It&rsquo;s not GPT-4, but for summarisation, first drafts, and things you don&rsquo;t want sending to a third-party API, it&rsquo;s more than good enough.</p>
<h3 id="brew-buddy">Brew Buddy</h3>
<p>I make kombucha. I was tracking fermentation batches in a notes app and getting annoyed at not being able to see history across batches. So I built a tracker.</p>
<p>Brew Buddy is a React frontend and a Go API backed by PostgreSQL, all running in the <code>web-brew-buddy</code> namespace. The images are built locally and imported into the cluster&rsquo;s container runtime with <code>k3s ctr images import</code>. It&rsquo;s deployed like any other app — a values file, an entry in <code>applications.yml</code>, a Vault secret for the database password.</p>
<p>The point isn&rsquo;t the app. The point is that the platform handles a custom hobby project with the same operational properties as Vaultwarden or Immich. Same GitOps workflow, same secret management, same network isolation, same TLS termination. Adding an app to this cluster takes an afternoon of writing manifests and a few seconds of git push. The platform work was done once.</p>
<h3 id="qr-device-login">QR device login</h3>
<p>This one has <a href="/posts/qr-device-login/">its own post</a> because it took three days and four complete rewrites of oauth2-proxy&rsquo;s session format to get right.</p>
<p>The short version: the Homer dashboard on the living room TV needed a way to log in without typing credentials on a TV keyboard. I built a device-flow OAuth service — phone scans QR, phone authenticates with GitLab, TV session is created. End session from the phone kills the TV&rsquo;s session immediately by deleting the oauth2-proxy Redis ticket.</p>
<p>It&rsquo;s the most overengineered solution to a problem I have, and I don&rsquo;t regret a minute of it.</p>
<hr>
<h2 id="what-operating-this-way-actually-changes">What operating this way actually changes</h2>
<p>The practical difference of the no-manual-kubectl rule is larger than it sounds.</p>
<p><strong>The audit trail is automatic.</strong> Every change to the cluster is a git commit with an author, a timestamp, and a diff. There&rsquo;s no &ldquo;what did I change last Tuesday?&rdquo; — I know exactly what changed last Tuesday, and I can revert it with <code>git revert</code>. The Argo CD UI shows the diff between what&rsquo;s in Git and what&rsquo;s running. If there&rsquo;s a diff, something went wrong.</p>
<p><strong>New services are cheap to add.</strong> The platform does the repetitive work — namespace, RBAC, network policies, TLS termination, OAuth protection. Adding a new app is writing the manifests and updating <code>applications.yml</code>. The infrastructure concerns are handled.</p>
<p><strong>Recovery is straightforward.</strong> If I rebuild the node (which I&rsquo;ve done), I run two bootstrap scripts, apply one Argo CD manifest, and the cluster reconciles itself from Git over the next few minutes. The only things that require manual work are the secrets that can&rsquo;t live in Git — two OAuth credentials and the Cloudflare tunnel token, all recreated by <code>scripts/create-secrets.sh</code>.</p>
<p><strong>Experimentation is safe.</strong> I run things on <code>toggleable: true</code> apps that I&rsquo;m not sure I&rsquo;ll keep. Turning them off is removing the entry from <code>applications.yml</code> and pushing. Turning them back on is adding it back.</p>
<hr>
<h2 id="what-it-doesnt-solve">What it doesn&rsquo;t solve</h2>
<p>Bootstrap is manual. The first <code>kubectl apply -f argocd/root-app.yaml</code> happens outside of GitOps by definition. The three bootstrap secrets can&rsquo;t be in Git. This is unavoidable — you need to trust something before GitOps can take over, and that something is a short manual procedure.</p>
<p>Some things fight the model. k3s&rsquo;s built-in addon controller rewrites the metrics-server Deployment on every k3s restart, removing a patch needed for Cilium compatibility. The fix is a pod that watches for the revert and reapplies the patch. It works, but it&rsquo;s a workaround for a component I don&rsquo;t control.</p>
<p>Single-node means single point of failure. For a homelab, that&rsquo;s acceptable. For anything important, it&rsquo;s not.</p>
<hr>
<h2 id="the-honest-summary">The honest summary</h2>
<p>I set out to learn production-grade Kubernetes patterns, and I did. The GitOps constraint turned out to be the best engineering decision in the project — not because it made things easier in the short term (it didn&rsquo;t), but because it forced every change through a path that is auditable, reversible, and consistent.</p>
<p>The cluster is a single ThinkCentre running about thirty services, secured by Cilium network policies, authenticated via GitLab SSO, with secrets managed by Vault and all configuration in a Git repo that I could hand to someone tomorrow and they&rsquo;d understand what&rsquo;s running and why.</p>
<p>That&rsquo;s the goal. For a homelab, I&rsquo;ll call it achieved.</p>
]]></content:encoded></item><item><title>I Inherited a System With No Map. So I Drew Two.</title><link>https://blog.hippotion.com/posts/inherited-a-system-no-map/</link><pubDate>Fri, 28 Feb 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/inherited-a-system-no-map/</guid><description>How I turned a tribal-knowledge handover into a two-track learning roadmap — one track for the technology, one for our system, designed to interleave.</description><content:encoded><![CDATA[<p>When I took over DevOps, the handover was a person, not a document. That person was leaving. Everything I&rsquo;d need to keep thirty-odd services and a fleet of customer servers alive lived in his head, in scattered runbooks, and in the muscle memory of having done it before. The classic shape: the system worked, and exactly one human knew why.</p>
<p>So the first real project wasn&rsquo;t a migration or a dashboard. It was writing down the system before the only other copy walked out the door.</p>
<p>The obvious move is to write <em>the docs</em> — one big knowledge base, ordered however the system happens to be wired. I tried that for about a day. It doesn&rsquo;t work, and the reason it doesn&rsquo;t work is the whole point of this post.</p>
<h2 id="the-two-questions-a-new-hire-is-actually-asking">The two questions a new hire is actually asking</h2>
<p>Watch someone learn an unfamiliar platform and you&rsquo;ll notice they&rsquo;re never confused about one thing. They&rsquo;re confused about two, and they&rsquo;re different kinds of confused.</p>
<p>The first is <strong>&ldquo;what is this technology?&rdquo;</strong> — what&rsquo;s a Pod, what does ArgoCD actually do, why would anyone want a secret manager with leases. This confusion is generic. It has nothing to do with us. The answer is the same whether you&rsquo;re here or anywhere else.</p>
<p>The second is <strong>&ldquo;how do <em>we</em> use it?&rdquo;</strong> — where our ArgoCD lives, how our customer tokens are minted, which Grafana panel goes red first when a backup stalls. This confusion is entirely local. No textbook will ever answer it, because the answer is our repo and our decisions.</p>
<p>A single linear document forces these two into one sequence, and they fight. Explain Kubernetes from scratch and the engineer who already knows it skims and misses the system-specific bit buried in paragraph six. Skip the basics and the engineer who <em>doesn&rsquo;t</em> know it is lost before they reach anything useful. You can&rsquo;t order one list to serve both readers. So I stopped trying.</p>
<h2 id="track-1-is-the-textbook-track-2-is-the-house">Track 1 is the textbook. Track 2 is the house.</h2>
<p>The fix was to split the knowledge base along that exact seam.</p>
<p><strong>Track 1 — Technical Foundation.</strong> Ten pages of generic DevOps: Linux, containers, Kubernetes concepts, Helm, GitOps &amp; ArgoCD, GitLab CI/CD, Vault, Argo Events, observability, Terraform. Every page is something you could, in principle, read on any platform team on earth. Assumed background is stated up front — comfortable with Linux and shell, no Kubernetes required — so nobody has to guess whether a page is for them.</p>
<p><strong>Track 2 — Our System.</strong> A dozen-plus pages of nothing <em>but</em> us: the cluster and its app-of-apps, the deploy pipelines, the customer model, the monitoring and backup agent, our Vault layout and token expiry monitoring, SSO, the approval portal, the full new-customer install. Every page assumes you already understand the underlying tech — and if you don&rsquo;t, it links straight back to its Track 1 counterpart.</p>
<p>That&rsquo;s the rule that keeps the split honest: each Track 1 page ends with an &ldquo;in our system&rdquo; link down to its implementation, and each Track 2 page names its Track 1 prerequisite at the top. Concept and implementation are separate documents, permanently wired to each other.</p>
<p>The win is that both tracks stand alone. A senior who&rsquo;s done Kubernetes for years skips Track 1 entirely and reads Track 2 like a system design doc. A strong sysadmin with zero cloud-native experience leans hard on Track 1 first. Same knowledge base, two honest reading paths, neither one padded for the other reader.</p>
<h2 id="the-interleave-is-the-whole-trick">The interleave is the whole trick</h2>
<p>Two tracks on their own would just be two piles. The thing that makes them a <em>roadmap</em> is the order you walk them in — and the order is a zipper, not two straight lines.</p>
<pre tabindex="0"><code>Track 1: Technical Foundation        Track 2: Our System
───────────────────────────────      ──────────────────────────────────
K8s concepts          → then →       K8s in our cluster
ArgoCD concepts       → then →       our ArgoCD + GitOps flow
Vault concepts        → then →       Vault here, customer tokens
Observability theory  → then →       our Grafana dashboards, alert types
</code></pre><p>Learn the concept cold, then immediately see it wearing our clothes. The generic mental model gets nailed down by a concrete, real, in-production example before it has time to evaporate — which is the difference between &ldquo;I read about ArgoCD once&rdquo; and &ldquo;I know where our ArgoCD is and what drift looks like on it.&rdquo; Read-then-do, not read-then-read.</p>
<h2 id="four-phases-because-learn-devops-isnt-a-task">Four phases, because &ldquo;learn DevOps&rdquo; isn&rsquo;t a task</h2>
<p>A pile of pages still isn&rsquo;t a plan, so the roadmap sits on top of both tracks and spends them over twenty weeks, in four phases, each with one blunt milestone:</p>
<table>
	<thead>
			<tr>
					<th>Phase</th>
					<th>Weeks</th>
					<th>Milestone</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>Foundations</td>
					<td>1–3</td>
					<td>Can describe every component and monitor alerts</td>
			</tr>
			<tr>
					<td>Operations</td>
					<td>4–8</td>
					<td>Can deploy a customer stack and restore a backup solo</td>
			</tr>
			<tr>
					<td>Ownership</td>
					<td>9–14</td>
					<td>Can install a new customer from scratch</td>
			</tr>
			<tr>
					<td>Mastery</td>
					<td>15–20</td>
					<td>Can train someone else</td>
			</tr>
	</tbody>
</table>
<p>The milestones are deliberately verbs, not reading counts. Nobody is &ldquo;done with Phase 2&rdquo; because they finished the pages. They&rsquo;re done when they&rsquo;ve restored a backup without me in the room. The last milestone is the one that matters most to me personally — <em>can train someone else</em> — because that&rsquo;s the only state in which I&rsquo;m allowed to be hit by a bus.</p>
<h2 id="the-readiness-tracker-or-vibes-dont-scale">The readiness tracker, or: vibes don&rsquo;t scale</h2>
<p>Here&rsquo;s the part I&rsquo;m most attached to, because it&rsquo;s the part that fixes the original problem. &ldquo;Are you ready to own this?&rdquo; answered by gut feel is exactly the tribal-knowledge trap I was trying to escape, just relocated into the new hire&rsquo;s head.</p>
<p>So full ownership is broken into eight weighted domains, and at the end of every phase you score yourself against them — honestly — and then study your <em>lowest</em> numbers, not your favorites. It turns &ldquo;do I know enough yet?&rdquo; from a vibe into a number with a gap next to it. The same instinct I&rsquo;d apply to a service I&rsquo;m monitoring, pointed at a person&rsquo;s readiness instead. You don&rsquo;t get to feel ready. You get to be measurably less unready every three weeks.</p>
<h2 id="what-id-tell-the-next-me">What I&rsquo;d tell the next me</h2>
<p>The mistake I almost made was treating onboarding docs as a <em>description of the system</em>. They&rsquo;re not. A description is ordered by how the machine is built. Onboarding has to be ordered by how a human learns — and a human learning a platform is running two processes at once, the general and the specific, and you have to feed both without starving either.</p>
<p>Splitting the knowledge base in two felt like more work and more surface to maintain. It was the opposite. Now when the tech changes, I edit Track 1. When <em>we</em> change, I edit Track 2. The seam that makes it easy to read is the same seam that makes it easy to keep alive.</p>
<p>The handover I got was a person. The handover I&rsquo;m leaving is a map — and it&rsquo;s drawn so the next person can read it without me standing behind them. That was the entire goal. The fact that I can now point a brand-new hire at a URL instead of at my calendar is just the proof it worked.</p>
]]></content:encoded></item></channel></rss>