Why this exists
I’ve been working in DevOps and platform engineering long enough to know what I don’t know. The patterns that separate robust infrastructure from “it works on my machine” infrastructure — GitOps, admission policies, network segmentation, secrets management — are easy to read about. They’re harder to actually internalise without running them yourself.
So I built a homelab. An old ThinkCentre I had sitting around, k3s, and a rule I set for myself before writing a single line of configuration: GitLab is the only source of truth. No manual kubectl after bootstrap. All changes go through git push.
That rule turned out to be more consequential than I expected.
The stack
The cluster runs about thirty services across two categories: infrastructure that makes the platform work, and applications that actually do things.
Infrastructure:
- k3s — lightweight Kubernetes, single-node
- Cilium — CNI with NetworkPolicy support (Flannel, k3s’s default, silently ignores NetworkPolicies)
- Argo CD — GitOps reconciler, watches the repo, applies changes
- Traefik — ingress controller, two entrypoints
- Cloudflare tunnel — external access without open ports
- cert-manager — wildcard TLS cert via Let’s Encrypt DNS-01
- oauth2-proxy — GitLab SSO protecting everything by default
- Vault + External Secrets Operator — secrets management
- Pi-hole — local DNS for
*.hippotion.com
Applications: a media server (Jellyfin, *arr stack), Immich for photos, Vaultwarden for passwords, Home Assistant, n8n for automation, a Hugo blog, Obsidian via browser-based KasmVNC, and a few custom-built things I’ll get to below.
Traffic reaches the cluster in two ways
External traffic (from anywhere on the internet) goes through a Cloudflare tunnel. The cloudflared pod dials out to Cloudflare — no open ports on the server, no firewall rules, no exposed IP. Cloudflare terminates TLS and forwards plain HTTP to Traefik on port 7080. Cloudflare handles the certificate for external visitors.
Local traffic (home WiFi) goes through Pi-hole, which resolves *.hippotion.com to the server’s LAN IP. Traefik receives HTTPS on port 443, served with a wildcard certificate that cert-manager issues from Let’s Encrypt via DNS-01 challenge. Port 80 redirects to 443; the cloudflare entrypoint on 7080 does not redirect, because it’s already receiving plain HTTP from cloudflared.
The result: the same IngressRoute handles both paths.
spec:
entryPoints:
- cloudflare # plain HTTP from the cloudflared pod
- websecure # local HTTPS with wildcard cert
routes:
- match: Host(`myapp.hippotion.com`)
kind: Rule
middlewares:
- name: oauth-auth
namespace: sys-oauth2-gitlab
services:
- name: myapp
port: 8080
Every IngressRoute has both entrypoints. If you forget one, the service is unreachable from half your access paths. Learned that the first time I added an app and couldn’t reach it from the phone.
One file generates everything
The centrepiece of the setup is applications.yml — a single file that is the complete list of everything running in the cluster. Every entry generates a Namespace, an Argo CD AppProject, an Application, NetworkPolicies, and RBAC. Nothing is created anywhere else.
An entry looks like this:
- namespace: web-vaultwarden
networkPolicies:
profile: web-app
applications:
- applicationCode: web-vaultwarden
path: helm-charts/extra-objects
autoSync: true
Six lines. That deploys a namespace, an Argo CD app that watches helm-charts/extra-objects/values-web-vaultwarden.yml, a full set of Cilium NetworkPolicies based on the web-app profile (deny-all with ingress from Traefik and egress to external), and a ServiceAccount. Adding a new service to the cluster is this file plus a values file with the actual Kubernetes manifests.
The profile: web-app notation deserves a word. Raw NetworkPolicy YAML is repetitive and error-prone — every namespace needs a deny-all base plus specific allows. I template it. A Helm chart maps profile names to concrete policy sets. web-app means: deny all ingress except from the ingress namespace, deny all egress except DNS and external HTTPS. web-app-internal means the same but no external egress — suitable for services that only talk to other in-cluster services. media-server adds port 6881 for BitTorrent. The policies are generated; no one writes them by hand.
Secrets without storing them in Git
Kubernetes Secret objects are not secrets. They’re base64-encoded blobs in etcd, and base64 is not encryption. Committing them to a Git repo — even a private one — is the wrong answer.
The setup here uses HashiCorp Vault as the actual secret store, with External Secrets Operator syncing Vault paths to Kubernetes Secrets. What lives in Git is an ExternalSecret CRD:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: myapp-credentials
namespace: myapp
spec:
secretStoreRef:
name: vault
kind: ClusterSecretStore
target:
name: myapp-credentials
data:
- secretKey: DB_PASSWORD
remoteRef:
key: secret/myapp
property: db-password
This is safe to commit. It says where the secret lives, not what it is. Vault contains the actual value. ESO syncs it to the cluster and refreshes every hour. Rotation means updating the value in Vault — no Git commit, no deployment.
Vault runs in-cluster with a sidecar that auto-unseals on restart. Not production-grade (the unseal key is on the same PVC as Vault itself), but pragmatic for a homelab where availability matters more than a sophisticated key management ceremony.
Three things I built that were worth building
Local AI inference
The cluster runs a local LLM. The web-ai-engine namespace has Open WebUI fronting a llama-server serving Phi-3.5 Mini in GGUF format. The model file lives on the node’s filesystem, mounted as a hostPath volume.
web-openclaw is a personal AI assistant UI that can route requests to either external providers (via NVIDIA’s API) or the local llama-server, depending on the task. The local model handles things that don’t need to leave the house; the external API handles things that do. The network policy for web-openclaw explicitly allows egress to web-ai-engine and nowhere else for local inference.
Running a 3.8B parameter model on homelab hardware is genuinely useful and costs nothing per query. It’s not GPT-4, but for summarisation, first drafts, and things you don’t want sending to a third-party API, it’s more than good enough.
Brew Buddy
I make kombucha. I was tracking fermentation batches in a notes app and getting annoyed at not being able to see history across batches. So I built a tracker.
Brew Buddy is a React frontend and a Go API backed by PostgreSQL, all running in the web-brew-buddy namespace. The images are built locally and imported into the cluster’s container runtime with k3s ctr images import. It’s deployed like any other app — a values file, an entry in applications.yml, a Vault secret for the database password.
The point isn’t the app. The point is that the platform handles a custom hobby project with the same operational properties as Vaultwarden or Immich. Same GitOps workflow, same secret management, same network isolation, same TLS termination. Adding an app to this cluster takes an afternoon of writing manifests and a few seconds of git push. The platform work was done once.
QR device login
This one has its own post because it took three days and four complete rewrites of oauth2-proxy’s session format to get right.
The short version: the Homer dashboard on the living room TV needed a way to log in without typing credentials on a TV keyboard. I built a device-flow OAuth service — phone scans QR, phone authenticates with GitLab, TV session is created. End session from the phone kills the TV’s session immediately by deleting the oauth2-proxy Redis ticket.
It’s the most overengineered solution to a problem I have, and I don’t regret a minute of it.
What operating this way actually changes
The practical difference of the no-manual-kubectl rule is larger than it sounds.
The audit trail is automatic. Every change to the cluster is a git commit with an author, a timestamp, and a diff. There’s no “what did I change last Tuesday?” — I know exactly what changed last Tuesday, and I can revert it with git revert. The Argo CD UI shows the diff between what’s in Git and what’s running. If there’s a diff, something went wrong.
New services are cheap to add. The platform does the repetitive work — namespace, RBAC, network policies, TLS termination, OAuth protection. Adding a new app is writing the manifests and updating applications.yml. The infrastructure concerns are handled.
Recovery is straightforward. If I rebuild the node (which I’ve done), I run two bootstrap scripts, apply one Argo CD manifest, and the cluster reconciles itself from Git over the next few minutes. The only things that require manual work are the secrets that can’t live in Git — two OAuth credentials and the Cloudflare tunnel token, all recreated by scripts/create-secrets.sh.
Experimentation is safe. I run things on toggleable: true apps that I’m not sure I’ll keep. Turning them off is removing the entry from applications.yml and pushing. Turning them back on is adding it back.
What it doesn’t solve
Bootstrap is manual. The first kubectl apply -f argocd/root-app.yaml happens outside of GitOps by definition. The three bootstrap secrets can’t be in Git. This is unavoidable — you need to trust something before GitOps can take over, and that something is a short manual procedure.
Some things fight the model. k3s’s built-in addon controller rewrites the metrics-server Deployment on every k3s restart, removing a patch needed for Cilium compatibility. The fix is a pod that watches for the revert and reapplies the patch. It works, but it’s a workaround for a component I don’t control.
Single-node means single point of failure. For a homelab, that’s acceptable. For anything important, it’s not.
The honest summary
I set out to learn production-grade Kubernetes patterns, and I did. The GitOps constraint turned out to be the best engineering decision in the project — not because it made things easier in the short term (it didn’t), but because it forced every change through a path that is auditable, reversible, and consistent.
The cluster is a single ThinkCentre running about thirty services, secured by Cilium network policies, authenticated via GitLab SSO, with secrets managed by Vault and all configuration in a Git repo that I could hand to someone tomorrow and they’d understand what’s running and why.
That’s the goal. For a homelab, I’ll call it achieved.
