Devops on hippotion

📝 Dev Notes

Sun, 21 Jun 2026 00:00:00 +0000

Kubernetes: init container crash loop leaves dirty emptyDir

When a pod’s init container crashes, Kubernetes restarts only the init container — not the whole pod. The emptyDir volume survives between retries. If your init container does a git clone into a fixed path, the second attempt fails with “destination path already exists.”

Fix: rm -rf the target dir before cloning.

rm -rf /git/repo
git clone --depth=10 --branch=main https://... /git/repo

After many restarts, no manual cleanup needed. Events expire in ~1h, old pods are replaced automatically by the Deployment controller. Check recovery with:

kubectl get events -n  --sort-by='.lastTimestamp' | tail -10

A “CPU spike” that was actually memory thrashing (adding GA4 to Hugo)

Wanted Google Analytics on this blog. PaperMod already calls a google_analytics.html partial in head.html, but it’s gated behind hugo.IsProduction | or (eq site.Params.env "production"). My blog pod runs hugo server, which always reports the environment as development — so the partial never fires. I “fixed” that by setting env = "production".

That was the wrong lever. env = production flips on Hugo’s whole production path — minification, OpenGraph, Twitter cards, schema JSON across every page. The next full rebuild blew past the pod’s 128Mi memory limit and got OOMKilled (exit 137). Server load jumped.

The right way to add GA without touching the build mode: drop the tag in layouts/_partials/extend_head.html. PaperMod includes that partial unconditionally, above the production guard — so it loads under hugo server too.

But here’s the part that fooled me. After reverting env, load was still climbing — to ~14 on a single node — and ps showed hugo at “500% CPU”. Looked like a runaway compute loop. It wasn’t:

%Cpu(s): 2.1 us, 41.0 sy, 6.9 id, 50.0 wa     <- 50% iowait, 2% userspace
PID ... S  %CPU  COMMAND
... D  333  hugo    <- state D, RES pinned at 127MiB (the 128Mi cgroup limit)

Two lessons:

ps %CPU is a lifetime average, not instantaneous. A process that ran hot for 1s then blocked still shows a big number for a while. Use top for what’s happening now.
High load + high %wa + a D-state process sitting at its cgroup memory limit = memory thrashing, not CPU. Hugo wasn’t computing — it was wedged against the 128Mi ceiling, and every allocation triggered kernel reclaim/swap. A sub-second build dragged out for minutes in uninterruptible I/O sleep, and all those blocked tasks are what inflate load average (Linux counts D-state in load).

The actual fix was boring: 128Mi was always marginal for hugo-extended + PaperMod. Bumped the limit to 512Mi and the thrash vanished.

Takeaway: when load spikes, read %wa and process state before blaming the CPU. And don’t flip env=production on a long-lived hugo server just to ungate one partial — use extend_head.html.

Self-hosting Supabase (lean) on k3s: the gotcha checklist

Ran the community supabase/supabase chart on a 16Gi single node — enabled db, rest, auth, meta, studio, kong + the log pipeline (analytics/Logflare + vector); left realtime, storage, imgproxy, edge-functions off. The deploy is easy; these are the things that actually bit:

Studio shows “no tables”. Supabase is single-database by design — Studio, PostgREST and auth all use the database named postgres. App tables in a separate database are invisible to all of it. Put your schema in postgres’s public schema.
Studio won’t schedule with edge-functions disabled. Its Deployment mounts the functions PVC unconditionally. Either run functions, or create the PVC yourself and leave functions off.
edge-functions crashloops if you keep it: it boots by fetching a Deno module from the internet, which a deny-all egress policy blocks. You usually only want the PVC it leaves behind anyway.
vector (log collector) stays silent under a deny-all policy. It discovers pods via the Kubernetes API, so it needs API egress, not just app ports (allowEgressToKubeApi). A log shipper that can’t reach the API collects nothing and doesn’t say why.
secretRef must contain every key the chart maps — including non-secret ones like database and openAiApiKey. Miss one and pods sit in CreateContainerConfigError.
ESO ExternalSecret shows perpetual OutOfSync in Argo CD unless you spell out the remoteRef defaults (conversionStrategy: Default, decodingStrategy: None, metadataPolicy: None) — ESO writes them back, and the compact form drifts.
postgres is not a superuser. CREATE DATABASE … OWNER app fails with must be member of role. Supabase keeps the real superuser (supabase_admin) to itself; GRANT app TO postgres first.
Logflare needs no BigQuery. It runs on the self-hosted Postgres backend (the _supabase database, _analytics schema) — logs land in _analytics.log_events_*.

None of this is in the README. It’s the gap between “I deployed Supabase” and “I run it.”

🎯 Know the Market Without Job-Hunting: An LLM-Scored Job Poller in n8n

Fri, 13 Feb 2026 00:00:00 +0000

You don’t have to be about to change jobs to want to know the landscape. What’s being built, what it pays, where you’d actually fit — staying current on the market (and your own worth) is just good professional hygiene. The trouble is that checking is tedious, so most of us don’t, until we’re already job-hunting and starting cold.

So I automated mine. An n8n workflow on my homelab polls job boards every six hours, scores each new posting against my profile with an LLM, and emails me only the strong matches — the ones scoring 80%+. When it’s quiet, it’s silent. When something genuinely fits, I know the same day. Here’s what I learned building it. Repo at the bottom.

Three APIs cover most of the market

Company career pages look bespoke, but underneath, the vast majority run on one of three ATS — and all three hand you the jobs as unauthenticated JSON:

Greenhouse — boards-api.greenhouse.io/v1/boards/{token}/jobs?content=true
Lever — api.lever.co/v0/postings/{token}?mode=json
Ashby — api.ashbyhq.com/posting-api/job-board/{token}?includeCompensation=true

No scraping, no headless browser. You poll the API the page itself calls, normalize the three shapes into one { company, title, location, remote, url, posted_at, description, external_id }, and you’re done with the hard part.

“Resolve the token” is half the battle

The naive assumption — the token is the company name, and everyone’s on one of the three — is half right. When I probed my initial wishlist, roughly half 404’d everywhere: HashiCorp (now under IBM → Workday), SUSE (SuccessFactors), Aiven (Teamtailor), Hugging Face. They’re on a fourth or fifth system entirely. The honest move was to ship the ~33 that actually resolve and leave the rest as disabled config stubs. Verify before you trust a slug.

Dedup without a database

I didn’t want to stand up Postgres just to remember which jobs I’d already seen. n8n’s Data Tables handle it natively: a seen_jobs table, an external_id namespaced {ats}:{company}:{id}, and the rowNotExists operation drops anything already recorded. State lives inside n8n, backed up with it. Zero extra infrastructure.

The ordering matters: notify first, mark seen second. The insert only happens after the email sends, so a failed send retries next run instead of silently swallowing a posting.

The location filter is a trap

My first version kept everything that wasn’t explicitly US-based. The inbox filled with “Senior Platform Engineer — Spain (Remote)” and "… — United Kingdom (Remote)". Those aren’t remote-for-me — they’re remote if you live in Spain. Useless from where I sit.

The fix was to invert the logic. Keep only three things:

globally-remote / worldwide / anywhere,
pan-EU (EMEA / Europe / EU / EEA),
my own country.

…and drop single-country remote, even EU ones. Region and home matches win over the country deny-list, ambiguous locations are kept (a missed match is worse than one extra line to skim). That one change cut the noise more than anything else.

Let an LLM read the actual job

Keyword + location filtering gets you a candidate list, but it can’t tell a “Platform Engineer” who herds Kubernetes from a “Platform Engineer” who owns a Figma design system. The job description can.

So the last step scores each new posting against my CV. My first version batched all of them into one big LLM call — which promptly timed out on the free tier. The fix was the opposite: one small call per job, which also means a single slow or rate-limited job never sinks the batch. Each call asks a NVIDIA NIM model (Llama 3.1 8B, OpenAI-compatible) for one number and a reason:

Score this job 0–100 for fit against my profile. Return {score, reason}.

That score is what lets me widen the net instead of narrowing it. On top of the curated company list I pull a broad remote-jobs feed (every company, all categories); the cheap keyword + location filters do the first pass, then I only email the roles scoring 80%+. Casting wide is fine when a model is the bar at the door. A line ends up looking like:

92% — Grafana Labs — Senior Platform Engineer (Remote, EMEA) — strong k8s/GitOps overlap — link

Scoring is fail-safe: if a call hiccups, that job is just skipped, and every posting gets marked seen either way — so nothing re-scores forever, and a rare bad run never floods or stalls the inbox.

The unglamorous bits that make it trustworthy

One bad source can’t kill the run — every fetch is wrapped; failures become a ⚠️ N sources failing footer so a company quietly changing ATS is visible, not invisible.
A prime run seeds the table silently the first time, so I’m not buried under every currently-open role on day one.
Everything tunable lives in one Config node — companies, keywords, location lists, the profile, the model — so adding a company is a one-line edit, not a graph safari.

Takeaways

The “scrape job boards” problem mostly isn’t a scraping problem — it’s three public APIs and a normalizer.
For personal automation, reach for the boring-but-correct primitive: native dedup state beats a database you have to operate.
An LLM works best here as the bar at the door: cheap deterministic filters keep the candidate set (and the cost) small, then the model gates on real fit — which is what lets you cast a wide net without drowning in it.

Workflow JSON, the full node-by-node breakdown, and setup notes: github.com/janos-gyorgy/ats-job-poller.

📦 Five Ways to Manage Kubernetes Manifests (and Why They're Not All Equal)

Fri, 10 Oct 2025 00:00:00 +0000

The problem everyone hits

You’ve got a Kubernetes cluster. Now you need to describe what should run in it. You write some YAML, apply it, it works.

Then you need a second environment. Or a second service. Or someone else joins the project and asks “how do I add an app to this?” and you don’t have a good answer.

This is the manifest management problem, and there are five common solutions — ranging from “this works until it doesn’t” to “this is what production platforms actually look like.”

Approach 1: Raw manifests

The starting point for almost everyone. Write a YAML file, kubectl apply -f, done.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  namespace: myapp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: myapp
          image: myapp:v1.2.3

Where it works: one service, one environment, learning Kubernetes. The feedback loop is immediate — write YAML, see what happens.

Where it breaks:

No templating. Want to change the image tag across ten services? Ten files, ten edits, ten chances to get it wrong.
Live state leaks in. If you export existing resources with kubectl get -o yaml, you get resourceVersion, generation, creationTimestamp, and managedFields in the output. Commit that to Git and you’ve created a permanent source of conflicts — ArgoCD compares what’s in Git against what’s in the cluster, sees stale version counters, and the diff never clears.
Copy-paste hell. A Deployment, a Service, an IngressRoute, a ServiceAccount, a NetworkPolicy — five files per app. Add a new app, copy five files, change the names, forget to update one. This is how environments drift apart silently.

The fix for the live-state problem is: only commit desired state. Strip every field that Kubernetes manages internally back to its clean spec. It’s tedious and easy to forget, which is exactly why people move on from raw manifests.

Approach 2: Kustomize

Kustomize is built into kubectl (kubectl apply -k) and natively supported by ArgoCD. The idea: you have a base/ with your raw manifests, and overlays that patch on top of them for different environments.

app/
├── base/
│   ├── deployment.yaml
│   ├── service.yaml
│   └── kustomization.yaml
└── overlays/
    ├── staging/
    │   ├── kustomization.yaml    # patches replicas to 1, image to :staging
    └── production/
        └── kustomization.yaml    # patches replicas to 3, image to :v1.2.3

# overlays/production/kustomization.yaml
resources:
  - ../../base
patches:
  - patch: |-
      - op: replace
        path: /spec/replicas
        value: 3
    target:
      kind: Deployment

Where it works: multi-environment setups where the difference between environments is mostly configuration values, not structure. Kustomize is good at this — you write the base once and patch only what differs.

Where it breaks:

No real parameterization. Kustomize patches are surgical edits, not templates. If your base structure needs to vary (different resource shapes per environment, conditional blocks), you’re fighting the tool.
Patching deep structures is ugly. JSON patches on nested YAML are verbose and hard to read. You end up writing more patch YAML than it would take to just copy the file.
Still repetitive across apps. Each app still gets its own base directory. You’re not abstracting the shared patterns across apps, only the differences between environments of the same app.

Kustomize is a significant step up from raw manifests for multi-environment setups. For complex templating or platform-level abstractions, it runs out of power quickly.

Approach 3: Helm

Helm adds real templating. Charts are parameterized bundles — templates with variables, conditionals, and loops — and values files supply the parameters.

# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Values.name }}
  namespace: {{ .Release.Namespace }}
spec:
  replicas: {{ .Values.replicas | default 1 }}
  template:
    spec:
      containers:
        - name: {{ .Values.name }}
          image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
          {{- if .Values.resources }}
          resources: {{ .Values.resources | toYaml | nindent 12 }}
          {{- end }}

# values-production.yaml
name: myapp
replicas: 3
image:
  repository: myorg/myapp
  tag: v1.2.3

Helm renders the templates at deploy time. What lands in the cluster is clean rendered YAML — no internal state, no conflicts.

Where it works: almost everywhere. The Helm Hub has charts for most common software already. For custom apps, writing a chart once and parameterizing per-environment is straightforwardly better than copying YAML.

Where it breaks:

Chart authoring is verbose. Writing a Helm chart from scratch involves a lot of Go templating boilerplate. For a simple app, it can feel like more scaffolding than application.
Debugging rendered output is annoying. helm template is your friend, but errors in templates produce unhelpful messages. The indentation rules (nindent, indent, toYaml) have sharp edges.
Values files still pile up. If every app has its own values file and there’s no shared structure between them, you’re back to copy-paste but now in YAML-that-configures-YAML.

Helm is the right tool for most Kubernetes deployments. The ecosystem support alone (upstream charts for Postgres, Redis, Vault, every CNCF project) makes it the pragmatic default.

Approach 4: Jsonnet / CUE

For teams that need programmatic config generation — actual code, not templates — Jsonnet and CUE are the serious alternatives.

// deployment.jsonnet
local k = import "k.libsonnet";

local deployment(name, image, replicas=1) =
  k.apps.v1.deployment.new(name, replicas, [
    k.core.v1.container.new(name, image)
  ]);

{
  "deployment.yaml": deployment("myapp", "myorg/myapp:v1.2.3", replicas=3)
}

Where it works: large platforms where configuration is genuinely complex — many environments, many apps, deep interdependencies. Jsonnet lets you write real functions, share libraries, compose abstractions properly.

Where it breaks:

Steep learning curve. Jsonnet is a full language. CUE even more so — it has types, schemas, and a constraint system that takes time to internalise.
Small community. Excellent tooling, but you’re solving problems that have fewer Stack Overflow answers.
Overkill for most setups. If you’re not managing hundreds of services across multiple clusters, Helm is simpler and has everything you need.

Jsonnet is used seriously at Google-scale infrastructure teams and in some CNCF projects. For a homelab or a small-to-medium platform, it’s the right answer to a question you probably aren’t asking yet.

Approach 5: App-of-apps with generated Application CRDs

This is the ArgoCD-native meta-layer. Instead of managing manifests, you manage Application resources — and potentially use a chart or tool to generate those too.

A naive version: commit a folder of Application YAML files to Git, one per service. ArgoCD watches the folder and deploys each app.

A more sophisticated version: one “root app” that points to a chart, which generates all the other Application resources dynamically from a single config file.

Where it works: at the platform level, not the individual app level. App-of-apps is how you manage what ArgoCD manages, not how you write the service manifests themselves. Combined with Helm, it gives you centralized control over the entire cluster’s structure.

Where it breaks:

Manual Application CRDs are painful. If you’re maintaining a folder of hand-written Application YAML files — one per service — you’ve traded manifest copy-paste for Application copy-paste. Each app needs its own CRD with its repo URL, path, sync policy, project reference.
Sync ordering matters. The root app must exist before children can sync. Get the wave ordering wrong and apps try to deploy before their namespaces exist.

How this homelab compares

My setup sits at the far end of approach 5, using Helm throughout.

There’s a single applications.yml file that describes every service in the cluster. A root Helm chart reads it and generates all the ArgoCD Application and AppProject CRDs automatically. Adding a service means adding an entry to that file — not touching five different places across five different files.

# applications.yml — this is the entire service catalog
- namespace: web-vaultwarden
  networkPolicies:
    profile: web-app
  applications:
    - applicationCode: web-vaultwarden
      path: helm-charts/extra-objects
      autoSync: true

That one entry generates: a Namespace, an ArgoCD AppProject, an ArgoCD Application, a set of Cilium NetworkPolicies (deny-all with ingress from Traefik and DNS/HTTPS egress), and a ServiceAccount. Nothing is written by hand.

The actual service manifests live in an extra-objects chart — a thin wrapper that renders raw YAML from values files. No templating in the service manifests themselves (they’re simple enough not to need it), but the infrastructure scaffolding around each app is entirely generated.

The result: every service gets the same operational properties. Same GitOps workflow, same secret management, same network isolation, same TLS termination. The platform work was done once. Adding a new app is writing manifests for the app’s specific behavior, not recreating the scaffolding.

The honest spectrum

Approach	Templating	Abstraction	Ecosystem	Complexity
Raw manifests	None	None	None	Low
Kustomize	Patches only	Overlays	Medium	Low-medium
Helm	Full	Per-chart	Large	Medium
Jsonnet/CUE	Full + typed	Libraries	Small	High
App-of-apps	Depends	Platform-level	ArgoCD-native	High

Most setups should start at Helm. Kustomize if you’re multi-environment and comfortable with patching. App-of-apps when you’re managing the platform layer, not individual services. Jsonnet/CUE when you know you’ve outgrown Helm — which is a specific and relatively rare problem to have.

Raw manifests are fine for learning. They’re the wrong answer for anything you intend to maintain.

More on how the homelab is structured: My Homelab Runs on GitOps.

🚨 Don't Restart the Node. Quarantine It First.

Fri, 01 Aug 2025 00:00:00 +0000

The reflex

Something’s wrong. A GitLab runner stops picking up jobs. An event processor starts dropping messages. A pod restarts in a loop. The node looks healthy — CPU fine, memory fine — but something is clearly off.

The reflex: restart the node, see if it clears.

Sometimes it does clear, and you move on. But you didn’t fix anything. You reset the state and crossed your fingers. If it happens again in two weeks, you’ll do the same thing. After enough iterations you have a “flaky node” that everyone reboots periodically and nobody understands.

There’s a better sequence. It takes twenty minutes instead of two, and you come out with either a real fix or actual knowledge of what happened.

Step one: quarantine, don’t kill

Before you touch anything, take the node out of rotation without destroying its current state.

kubectl cordon

Cordon marks the node as unschedulable. No new pods land on it. Existing pods keep running. If you need the workloads somewhere else immediately:

kubectl drain  --ignore-daemonsets --delete-emptydir-data

Now you’ve removed the node from production traffic without rebooting. The node is still alive. Everything that happened on it is still there: logs, open files, kernel ring buffer, running processes, memory state.

This is the difference. A reboot wipes that. A cordon preserves it.

Step two: look at what’s actually there

SSH in. Don’t grep for anything specific yet — do a pass for anything unusual.

Kernel messages first. The kernel will often tell you exactly what went wrong before any application did.

dmesg -T --level=err,warn | tail -50

OOM kills show up here. Disk errors show up here. CPU soft lockups show up here. If you’ve got any of those, you have your answer before you’ve even looked at application logs.

Check for filesystem problems.

df -h          # is anything full?
dmesg | grep -i "ext4\|xfs\|btrfs\|i/o error\|ata"

A filesystem at 100% is silent until it isn’t. A flaky drive starts dropping I/O errors into dmesg long before SMART reports anything. Application developers rarely think about this case — their app just starts writing logs that say “failed to write” without specifying that the disk is full or dying.

System resource pressure.

vmstat 1 5          # is there swap activity?
iostat -x 1 5       # is a disk saturated?
cat /proc/pressure/io   # kernel PSI — pressure stall info

PSI is underused. It tells you whether processes were actually stalled waiting for I/O, not just whether throughput was high. A disk at 80% utilisation might be fine; a disk with 40% I/O PSI pressure is actively hurting performance.

What were the pods doing right before things went sideways?

kubectl describe node     # events section at the bottom
kubectl get events --field-selector involvedObject.kind=Pod -A | sort -k1

Look for OOMKilled exits, failed liveness probes, and throttling events. Kubernetes events expire after an hour by default — another reason not to reboot immediately; those events are still there if you look now.

A real example: the GitLab runner

A GitLab runner pod stops picking up jobs. It looks alive — the process is running, no crashes in the pod logs. Jobs sit in the queue.

Restart reflex: delete the pod, let it reschedule, it picks up jobs again.

But why did it stop?

journalctl -u gitlab-runner --since "1 hour ago"
# or, if it's a container:
kubectl logs  --previous

In one instance: the runner’s working directory was on a tmpfs that hit its size limit. The runner silently failed to create job workspaces and stopped accepting new jobs. The error was one line in the pod logs: mkdir /builds: no space left on device. The pod was healthy by every other metric.

Fix: bump the tmpfs size limit in the runner config. The restart would have cleared tmpfs temporarily, and the runner would have failed again the next time a large job filled it up.

The debug took five minutes. The permanent fix took two minutes. Without quarantining the node first, the evidence was gone.

Another one: the event consumer

An event processor starts falling behind. Messages queue up. The pod shows no errors. Memory looks fine.

This one was subtler: the processor was connected to a downstream dependency over a persistent TCP connection. The connection had gone into a half-open state — the processor thought it was alive, the remote end had already dropped it. New messages were being sent into a dead socket and silently discarded.

ss -tnp | grep     # look at the socket state

CLOSE_WAIT on a connection that should be ESTABLISHED. The application wasn’t checking whether the connection was actually working before using it, just whether it existed.

Restart would have cleared the socket state, fixed the symptom, and left the bug in the code.

What to look for — a short checklist

When a node is misbehaving, in order:

dmesg -T --level=err,warn — kernel errors, OOM kills, disk errors
df -h && df -i — full filesystems (space and inodes separately)
kubectl describe node — pressure conditions, recent events
kubectl logs --previous — what the pod logged before it died or got stuck
ss -tnp — socket states for network-adjacent issues
vmstat 1 5 + iostat -x 1 5 — resource pressure
journalctl -p err -b — system journal errors since last boot

Most problems show up in the first three.

After you’ve found something (or not found something)

If you found the cause: fix it, test it, uncordon the node.

kubectl uncordon

Document what you found — a comment in the relevant config, a commit message, a note. “Fixed runner tmpfs limit” in the commit history is more useful than “flaky runner, restarted.”

If you genuinely found nothing: that’s information too. Cordon, reboot, uncordon, and note that the node rebooted clean with no identified cause. If it happens again, you have a pattern. Check whether anything changed in the workloads around that time. Check whether the reboot timing correlates with anything — cron jobs, backups, maintenance windows.

A reboot you can explain is a fix. A reboot you can’t explain is a time bomb.

Why this matters on a single-node cluster

In a multi-node setup you can afford to be lazier — cordon, drain, reboot, let the scheduler handle it, look at it later. On a single node there’s no “later.” The node coming back is all you’ve got.

But the habit is worth building regardless of node count. The engineers who understand their systems are the ones who looked before they rebooted.

The actual rule

Quarantine first. Debug second. Restart third (if you still need to).

A restart takes two minutes. The evidence it destroys might take two hours to reconstruct — or might be gone for good. The cordon costs you nothing.

🏗️ My Homelab Runs on GitOps. Here's What That Actually Means.

Fri, 28 Mar 2025 00:00:00 +0000

Why this exists

I’ve been working in DevOps and platform engineering long enough to know what I don’t know. The patterns that separate robust infrastructure from “it works on my machine” infrastructure — GitOps, admission policies, network segmentation, secrets management — are easy to read about. They’re harder to actually internalise without running them yourself.

So I built a homelab. An old ThinkCentre I had sitting around, k3s, and a rule I set for myself before writing a single line of configuration: GitLab is the only source of truth. No manual kubectl after bootstrap. All changes go through git push.

That rule turned out to be more consequential than I expected.

The stack

The cluster runs about thirty services across two categories: infrastructure that makes the platform work, and applications that actually do things.

Infrastructure:

k3s — lightweight Kubernetes, single-node
Cilium — CNI with NetworkPolicy support (Flannel, k3s’s default, silently ignores NetworkPolicies)
Argo CD — GitOps reconciler, watches the repo, applies changes
Traefik — ingress controller, two entrypoints
Cloudflare tunnel — external access without open ports
cert-manager — wildcard TLS cert via Let’s Encrypt DNS-01
oauth2-proxy — GitLab SSO protecting everything by default
Vault + External Secrets Operator — secrets management
Pi-hole — local DNS for *.hippotion.com

Applications: a media server (Jellyfin, *arr stack), Immich for photos, Vaultwarden for passwords, Home Assistant, n8n for automation, a Hugo blog, Obsidian via browser-based KasmVNC, and a few custom-built things I’ll get to below.

Traffic reaches the cluster in two ways

External traffic (from anywhere on the internet) goes through a Cloudflare tunnel. The cloudflared pod dials out to Cloudflare — no open ports on the server, no firewall rules, no exposed IP. Cloudflare terminates TLS and forwards plain HTTP to Traefik on port 7080. Cloudflare handles the certificate for external visitors.

Local traffic (home WiFi) goes through Pi-hole, which resolves *.hippotion.com to the server’s LAN IP. Traefik receives HTTPS on port 443, served with a wildcard certificate that cert-manager issues from Let’s Encrypt via DNS-01 challenge. Port 80 redirects to 443; the cloudflare entrypoint on 7080 does not redirect, because it’s already receiving plain HTTP from cloudflared.

The result: the same IngressRoute handles both paths.

spec:
  entryPoints:
    - cloudflare   # plain HTTP from the cloudflared pod
    - websecure    # local HTTPS with wildcard cert
  routes:
    - match: Host(`myapp.hippotion.com`)
      kind: Rule
      middlewares:
        - name: oauth-auth
          namespace: sys-oauth2-gitlab
      services:
        - name: myapp
          port: 8080

Every IngressRoute has both entrypoints. If you forget one, the service is unreachable from half your access paths. Learned that the first time I added an app and couldn’t reach it from the phone.

One file generates everything

The centrepiece of the setup is applications.yml — a single file that is the complete list of everything running in the cluster. Every entry generates a Namespace, an Argo CD AppProject, an Application, NetworkPolicies, and RBAC. Nothing is created anywhere else.

An entry looks like this:

- namespace: web-vaultwarden
  networkPolicies:
    profile: web-app
  applications:
    - applicationCode: web-vaultwarden
      path: helm-charts/extra-objects
      autoSync: true

Six lines. That deploys a namespace, an Argo CD app that watches helm-charts/extra-objects/values-web-vaultwarden.yml, a full set of Cilium NetworkPolicies based on the web-app profile (deny-all with ingress from Traefik and egress to external), and a ServiceAccount. Adding a new service to the cluster is this file plus a values file with the actual Kubernetes manifests.

The profile: web-app notation deserves a word. Raw NetworkPolicy YAML is repetitive and error-prone — every namespace needs a deny-all base plus specific allows. I template it. A Helm chart maps profile names to concrete policy sets. web-app means: deny all ingress except from the ingress namespace, deny all egress except DNS and external HTTPS. web-app-internal means the same but no external egress — suitable for services that only talk to other in-cluster services. media-server adds port 6881 for BitTorrent. The policies are generated; no one writes them by hand.

Secrets without storing them in Git

Kubernetes Secret objects are not secrets. They’re base64-encoded blobs in etcd, and base64 is not encryption. Committing them to a Git repo — even a private one — is the wrong answer.

The setup here uses HashiCorp Vault as the actual secret store, with External Secrets Operator syncing Vault paths to Kubernetes Secrets. What lives in Git is an ExternalSecret CRD:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: myapp-credentials
  namespace: myapp
spec:
  secretStoreRef:
    name: vault
    kind: ClusterSecretStore
  target:
    name: myapp-credentials
  data:
    - secretKey: DB_PASSWORD
      remoteRef:
        key: secret/myapp
        property: db-password

This is safe to commit. It says where the secret lives, not what it is. Vault contains the actual value. ESO syncs it to the cluster and refreshes every hour. Rotation means updating the value in Vault — no Git commit, no deployment.

Vault runs in-cluster with a sidecar that auto-unseals on restart. Not production-grade (the unseal key is on the same PVC as Vault itself), but pragmatic for a homelab where availability matters more than a sophisticated key management ceremony.

Three things I built that were worth building

Local AI inference

The cluster runs a local LLM. The web-ai-engine namespace has Open WebUI fronting a llama-server serving Phi-3.5 Mini in GGUF format. The model file lives on the node’s filesystem, mounted as a hostPath volume.

web-openclaw is a personal AI assistant UI that can route requests to either external providers (via NVIDIA’s API) or the local llama-server, depending on the task. The local model handles things that don’t need to leave the house; the external API handles things that do. The network policy for web-openclaw explicitly allows egress to web-ai-engine and nowhere else for local inference.

Running a 3.8B parameter model on homelab hardware is genuinely useful and costs nothing per query. It’s not GPT-4, but for summarisation, first drafts, and things you don’t want sending to a third-party API, it’s more than good enough.

Brew Buddy

I make kombucha. I was tracking fermentation batches in a notes app and getting annoyed at not being able to see history across batches. So I built a tracker.

Brew Buddy is a React frontend and a Go API backed by PostgreSQL, all running in the web-brew-buddy namespace. The images are built locally and imported into the cluster’s container runtime with k3s ctr images import. It’s deployed like any other app — a values file, an entry in applications.yml, a Vault secret for the database password.

The point isn’t the app. The point is that the platform handles a custom hobby project with the same operational properties as Vaultwarden or Immich. Same GitOps workflow, same secret management, same network isolation, same TLS termination. Adding an app to this cluster takes an afternoon of writing manifests and a few seconds of git push. The platform work was done once.

This one has its own post because it took three days and four complete rewrites of oauth2-proxy’s session format to get right.

The short version: the Homer dashboard on the living room TV needed a way to log in without typing credentials on a TV keyboard. I built a device-flow OAuth service — phone scans QR, phone authenticates with GitLab, TV session is created. End session from the phone kills the TV’s session immediately by deleting the oauth2-proxy Redis ticket.

It’s the most overengineered solution to a problem I have, and I don’t regret a minute of it.

What operating this way actually changes

The practical difference of the no-manual-kubectl rule is larger than it sounds.

The audit trail is automatic. Every change to the cluster is a git commit with an author, a timestamp, and a diff. There’s no “what did I change last Tuesday?” — I know exactly what changed last Tuesday, and I can revert it with git revert. The Argo CD UI shows the diff between what’s in Git and what’s running. If there’s a diff, something went wrong.

New services are cheap to add. The platform does the repetitive work — namespace, RBAC, network policies, TLS termination, OAuth protection. Adding a new app is writing the manifests and updating applications.yml. The infrastructure concerns are handled.

Recovery is straightforward. If I rebuild the node (which I’ve done), I run two bootstrap scripts, apply one Argo CD manifest, and the cluster reconciles itself from Git over the next few minutes. The only things that require manual work are the secrets that can’t live in Git — two OAuth credentials and the Cloudflare tunnel token, all recreated by scripts/create-secrets.sh.

Experimentation is safe. I run things on toggleable: true apps that I’m not sure I’ll keep. Turning them off is removing the entry from applications.yml and pushing. Turning them back on is adding it back.

What it doesn’t solve

Bootstrap is manual. The first kubectl apply -f argocd/root-app.yaml happens outside of GitOps by definition. The three bootstrap secrets can’t be in Git. This is unavoidable — you need to trust something before GitOps can take over, and that something is a short manual procedure.

Some things fight the model. k3s’s built-in addon controller rewrites the metrics-server Deployment on every k3s restart, removing a patch needed for Cilium compatibility. The fix is a pod that watches for the revert and reapplies the patch. It works, but it’s a workaround for a component I don’t control.

Single-node means single point of failure. For a homelab, that’s acceptable. For anything important, it’s not.

The honest summary

I set out to learn production-grade Kubernetes patterns, and I did. The GitOps constraint turned out to be the best engineering decision in the project — not because it made things easier in the short term (it didn’t), but because it forced every change through a path that is auditable, reversible, and consistent.

The cluster is a single ThinkCentre running about thirty services, secured by Cilium network policies, authenticated via GitLab SSO, with secrets managed by Vault and all configuration in a Git repo that I could hand to someone tomorrow and they’d understand what’s running and why.

That’s the goal. For a homelab, I’ll call it achieved.

I Inherited a System With No Map. So I Drew Two.

Fri, 28 Feb 2025 00:00:00 +0000

When I took over DevOps, the handover was a person, not a document. That person was leaving. Everything I’d need to keep thirty-odd services and a fleet of customer servers alive lived in his head, in scattered runbooks, and in the muscle memory of having done it before. The classic shape: the system worked, and exactly one human knew why.

So the first real project wasn’t a migration or a dashboard. It was writing down the system before the only other copy walked out the door.

The obvious move is to write the docs — one big knowledge base, ordered however the system happens to be wired. I tried that for about a day. It doesn’t work, and the reason it doesn’t work is the whole point of this post.

The two questions a new hire is actually asking

Watch someone learn an unfamiliar platform and you’ll notice they’re never confused about one thing. They’re confused about two, and they’re different kinds of confused.

The first is “what is this technology?” — what’s a Pod, what does ArgoCD actually do, why would anyone want a secret manager with leases. This confusion is generic. It has nothing to do with us. The answer is the same whether you’re here or anywhere else.

The second is “how do we use it?” — where our ArgoCD lives, how our customer tokens are minted, which Grafana panel goes red first when a backup stalls. This confusion is entirely local. No textbook will ever answer it, because the answer is our repo and our decisions.

A single linear document forces these two into one sequence, and they fight. Explain Kubernetes from scratch and the engineer who already knows it skims and misses the system-specific bit buried in paragraph six. Skip the basics and the engineer who doesn’t know it is lost before they reach anything useful. You can’t order one list to serve both readers. So I stopped trying.

Track 1 is the textbook. Track 2 is the house.

The fix was to split the knowledge base along that exact seam.

Track 1 — Technical Foundation. Ten pages of generic DevOps: Linux, containers, Kubernetes concepts, Helm, GitOps & ArgoCD, GitLab CI/CD, Vault, Argo Events, observability, Terraform. Every page is something you could, in principle, read on any platform team on earth. Assumed background is stated up front — comfortable with Linux and shell, no Kubernetes required — so nobody has to guess whether a page is for them.

Track 2 — Our System. A dozen-plus pages of nothing but us: the cluster and its app-of-apps, the deploy pipelines, the customer model, the monitoring and backup agent, our Vault layout and token expiry monitoring, SSO, the approval portal, the full new-customer install. Every page assumes you already understand the underlying tech — and if you don’t, it links straight back to its Track 1 counterpart.

That’s the rule that keeps the split honest: each Track 1 page ends with an “in our system” link down to its implementation, and each Track 2 page names its Track 1 prerequisite at the top. Concept and implementation are separate documents, permanently wired to each other.

The win is that both tracks stand alone. A senior who’s done Kubernetes for years skips Track 1 entirely and reads Track 2 like a system design doc. A strong sysadmin with zero cloud-native experience leans hard on Track 1 first. Same knowledge base, two honest reading paths, neither one padded for the other reader.

The interleave is the whole trick

Two tracks on their own would just be two piles. The thing that makes them a roadmap is the order you walk them in — and the order is a zipper, not two straight lines.

Track 1: Technical Foundation        Track 2: Our System
───────────────────────────────      ──────────────────────────────────
K8s concepts          → then →       K8s in our cluster
ArgoCD concepts       → then →       our ArgoCD + GitOps flow
Vault concepts        → then →       Vault here, customer tokens
Observability theory  → then →       our Grafana dashboards, alert types

Learn the concept cold, then immediately see it wearing our clothes. The generic mental model gets nailed down by a concrete, real, in-production example before it has time to evaporate — which is the difference between “I read about ArgoCD once” and “I know where our ArgoCD is and what drift looks like on it.” Read-then-do, not read-then-read.

Four phases, because “learn DevOps” isn’t a task

A pile of pages still isn’t a plan, so the roadmap sits on top of both tracks and spends them over twenty weeks, in four phases, each with one blunt milestone:

Phase	Weeks	Milestone
Foundations	1–3	Can describe every component and monitor alerts
Operations	4–8	Can deploy a customer stack and restore a backup solo
Ownership	9–14	Can install a new customer from scratch
Mastery	15–20	Can train someone else

The milestones are deliberately verbs, not reading counts. Nobody is “done with Phase 2” because they finished the pages. They’re done when they’ve restored a backup without me in the room. The last milestone is the one that matters most to me personally — can train someone else — because that’s the only state in which I’m allowed to be hit by a bus.

The readiness tracker, or: vibes don’t scale

Here’s the part I’m most attached to, because it’s the part that fixes the original problem. “Are you ready to own this?” answered by gut feel is exactly the tribal-knowledge trap I was trying to escape, just relocated into the new hire’s head.

So full ownership is broken into eight weighted domains, and at the end of every phase you score yourself against them — honestly — and then study your lowest numbers, not your favorites. It turns “do I know enough yet?” from a vibe into a number with a gap next to it. The same instinct I’d apply to a service I’m monitoring, pointed at a person’s readiness instead. You don’t get to feel ready. You get to be measurably less unready every three weeks.

What I’d tell the next me

The mistake I almost made was treating onboarding docs as a description of the system. They’re not. A description is ordered by how the machine is built. Onboarding has to be ordered by how a human learns — and a human learning a platform is running two processes at once, the general and the specific, and you have to feed both without starving either.

Splitting the knowledge base in two felt like more work and more surface to maintain. It was the opposite. Now when the tech changes, I edit Track 1. When we change, I edit Track 2. The seam that makes it easy to read is the same seam that makes it easy to keep alive.

The handover I got was a person. The handover I’m leaving is a map — and it’s drawn so the next person can read it without me standing behind them. That was the entire goal. The fact that I can now point a brand-new hire at a URL instead of at my calendar is just the proof it worked.

Devops on hippotion

📝 Dev Notes

Kubernetes: init container crash loop leaves dirty emptyDir

A “CPU spike” that was actually memory thrashing (adding GA4 to Hugo)

Self-hosting Supabase (lean) on k3s: the gotcha checklist

🎯 Know the Market Without Job-Hunting: An LLM-Scored Job Poller in n8n

Three APIs cover most of the market

“Resolve the token” is half the battle

Dedup without a database

The location filter is a trap

Let an LLM read the actual job

The unglamorous bits that make it trustworthy

Takeaways

📦 Five Ways to Manage Kubernetes Manifests (and Why They're Not All Equal)

The problem everyone hits

Approach 1: Raw manifests

Approach 2: Kustomize

Approach 3: Helm

Approach 4: Jsonnet / CUE

Approach 5: App-of-apps with generated Application CRDs

How this homelab compares

The honest spectrum

🚨 Don't Restart the Node. Quarantine It First.

The reflex

Step one: quarantine, don’t kill

Step two: look at what’s actually there

A real example: the GitLab runner

Another one: the event consumer

What to look for — a short checklist

After you’ve found something (or not found something)

Why this matters on a single-node cluster

The actual rule

🏗️ My Homelab Runs on GitOps. Here's What That Actually Means.

Why this exists

The stack

Traffic reaches the cluster in two ways

One file generates everything

Secrets without storing them in Git

Three things I built that were worth building

Local AI inference

Brew Buddy

QR device login

What operating this way actually changes

What it doesn’t solve

The honest summary

I Inherited a System With No Map. So I Drew Two.

The two questions a new hire is actually asking

Track 1 is the textbook. Track 2 is the house.

The interleave is the whole trick

Four phases, because “learn DevOps” isn’t a task

The readiness tracker, or: vibes don’t scale

What I’d tell the next me