Argocd on hippotion

I Run GitOps for My Brain

Fri, 01 May 2026 00:00:00 +0000

The pattern I didn’t know I had

This week an AI agent told me something about my own systems that I’d never noticed, and it was correct: I have one favorite architecture, and I’ve built it three times.

At work: git holds Terraform code → Terraform derives the S3 buckets. Nobody clicks around in the AWS console; the repo is the truth.
In the homelab: git holds Kubernetes manifests → ArgoCD derives the cluster. Every app on my rack is a folder in a repo.
In my second brain: a vault of markdown notes → an indexer derives the search database (SQLite FTS + a link graph) that my AI tools query.

Same shape everywhere: a plain-text source of truth in git, and a machine that builds the real thing from it. Master copy, derived state. I never decided this consciously — it’s just how my hands build things now.

GitOps isn’t the git part

Here’s the thing that the third copy got wrong, and it took me embarrassingly long to see because I teach this pattern at the infrastructure layer.

“Configuration in git” existed long before GitOps. What made GitOps an actual shift was the reconciler: ArgoCD doesn’t apply your manifests once and wish you luck. It watches, continuously. When the cluster drifts from the repo, you get an OutOfSync badge, and with selfHeal enabled it puts reality back where the repo says it should be. The loop is the product. Git is just where the loop points.

My vault had no loop. If I edited a note and forgot to rebuild the index, the search results my AI agents rely on were silently stale — no badge, no error, nothing. The only protection was a rule in the repo’s agent instructions: “if files and index disagree, the files win — run the indexer.”

A policy that agents must remember. In other words: I was running Kubernetes with a sticky note on the monitor that says please redeploy after editing the YAML. I would never accept that on my cluster. My brain ran on it for months.

The fix took an afternoon

Two pieces, both boring on purpose.

exo status — the OutOfSync badge. The indexer now stores a content hash per note; status re-hashes the vault and diffs:

{
  "status": "OutOfSync",
  "modified": ["vault/10-notes/interests-themes.md"],
  "new": [],
  "deleted": [],
  "repair": "exo index"
}

Exit code 0 when synced, 1 when not — so scripts and CI can ask the question too, exactly like argocd app get.

Git hooks — the selfHeal. Versioned hooks (core.hooksPath .githooks) on post-commit and post-merge rebuild the index after every commit and pull:

command -v exo >/dev/null 2>&1 || exit 0
EXO_ROOT="$(git rev-parse --show-toplevel)"
exo index >/dev/null 2>&1 && echo "exo: index reconciled (Synced)"

Now every git commit in the vault prints exo: index reconciled (Synced) on its way out. The rule didn’t change — files win — but it stopped being something agents must remember and became something a machine enforces. That’s the entire difference between configuration management and GitOps, replayed at the knowledge layer.

The part where it gets a little strange

The reason I’m writing this post at all: I didn’t have this idea. A scheduled agent did, on what I can only describe as an idle walk.

My vault has a weekly cron job — we call it the Wanderer — that samples pairs of notes that are far apart: different folders, different months, almost no shared vocabulary. A headless Claude gets the pairs with exactly one task: read both notes in full and say whether anything genuinely connects. “Nothing connects” is a successful run. That last sentence is load-bearing — the run always reports its result either way, so the agent never needs to manufacture a finding to have done its job.

On its very first walk, it collided a work note about Terraform-driven S3 provisioning with the architecture map of the vault itself, and wrote: same sentence in different clothes — and the brain copy is missing its reconciler. Then it listed the two fixes you just read about.

Retrieval answers the questions you ask. Distant collisions surface the questions you didn’t know you had. It turns out my second brain didn’t need to get better at remembering — it needed to occasionally interrupt me.

If you keep a vault

Whatever your stack — Obsidian, org-mode, a folder of markdown — if anything derives from your notes (an index, embeddings, a published site), then you have source of truth and derived state, and the GitOps question applies: who notices when they drift? If the answer is “I do, hopefully,” you’re running the sticky-note era. Give it a badge and a loop. It’s an afternoon.

📦 Five Ways to Manage Kubernetes Manifests (and Why They're Not All Equal)

Fri, 10 Oct 2025 00:00:00 +0000

The problem everyone hits

You’ve got a Kubernetes cluster. Now you need to describe what should run in it. You write some YAML, apply it, it works.

Then you need a second environment. Or a second service. Or someone else joins the project and asks “how do I add an app to this?” and you don’t have a good answer.

This is the manifest management problem, and there are five common solutions — ranging from “this works until it doesn’t” to “this is what production platforms actually look like.”

Approach 1: Raw manifests

The starting point for almost everyone. Write a YAML file, kubectl apply -f, done.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  namespace: myapp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: myapp
          image: myapp:v1.2.3

Where it works: one service, one environment, learning Kubernetes. The feedback loop is immediate — write YAML, see what happens.

Where it breaks:

No templating. Want to change the image tag across ten services? Ten files, ten edits, ten chances to get it wrong.
Live state leaks in. If you export existing resources with kubectl get -o yaml, you get resourceVersion, generation, creationTimestamp, and managedFields in the output. Commit that to Git and you’ve created a permanent source of conflicts — ArgoCD compares what’s in Git against what’s in the cluster, sees stale version counters, and the diff never clears.
Copy-paste hell. A Deployment, a Service, an IngressRoute, a ServiceAccount, a NetworkPolicy — five files per app. Add a new app, copy five files, change the names, forget to update one. This is how environments drift apart silently.

The fix for the live-state problem is: only commit desired state. Strip every field that Kubernetes manages internally back to its clean spec. It’s tedious and easy to forget, which is exactly why people move on from raw manifests.

Approach 2: Kustomize

Kustomize is built into kubectl (kubectl apply -k) and natively supported by ArgoCD. The idea: you have a base/ with your raw manifests, and overlays that patch on top of them for different environments.

app/
├── base/
│   ├── deployment.yaml
│   ├── service.yaml
│   └── kustomization.yaml
└── overlays/
    ├── staging/
    │   ├── kustomization.yaml    # patches replicas to 1, image to :staging
    └── production/
        └── kustomization.yaml    # patches replicas to 3, image to :v1.2.3

# overlays/production/kustomization.yaml
resources:
  - ../../base
patches:
  - patch: |-
      - op: replace
        path: /spec/replicas
        value: 3
    target:
      kind: Deployment

Where it works: multi-environment setups where the difference between environments is mostly configuration values, not structure. Kustomize is good at this — you write the base once and patch only what differs.

Where it breaks:

No real parameterization. Kustomize patches are surgical edits, not templates. If your base structure needs to vary (different resource shapes per environment, conditional blocks), you’re fighting the tool.
Patching deep structures is ugly. JSON patches on nested YAML are verbose and hard to read. You end up writing more patch YAML than it would take to just copy the file.
Still repetitive across apps. Each app still gets its own base directory. You’re not abstracting the shared patterns across apps, only the differences between environments of the same app.

Kustomize is a significant step up from raw manifests for multi-environment setups. For complex templating or platform-level abstractions, it runs out of power quickly.

Approach 3: Helm

Helm adds real templating. Charts are parameterized bundles — templates with variables, conditionals, and loops — and values files supply the parameters.

# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Values.name }}
  namespace: {{ .Release.Namespace }}
spec:
  replicas: {{ .Values.replicas | default 1 }}
  template:
    spec:
      containers:
        - name: {{ .Values.name }}
          image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
          {{- if .Values.resources }}
          resources: {{ .Values.resources | toYaml | nindent 12 }}
          {{- end }}

# values-production.yaml
name: myapp
replicas: 3
image:
  repository: myorg/myapp
  tag: v1.2.3

Helm renders the templates at deploy time. What lands in the cluster is clean rendered YAML — no internal state, no conflicts.

Where it works: almost everywhere. The Helm Hub has charts for most common software already. For custom apps, writing a chart once and parameterizing per-environment is straightforwardly better than copying YAML.

Where it breaks:

Chart authoring is verbose. Writing a Helm chart from scratch involves a lot of Go templating boilerplate. For a simple app, it can feel like more scaffolding than application.
Debugging rendered output is annoying. helm template is your friend, but errors in templates produce unhelpful messages. The indentation rules (nindent, indent, toYaml) have sharp edges.
Values files still pile up. If every app has its own values file and there’s no shared structure between them, you’re back to copy-paste but now in YAML-that-configures-YAML.

Helm is the right tool for most Kubernetes deployments. The ecosystem support alone (upstream charts for Postgres, Redis, Vault, every CNCF project) makes it the pragmatic default.

Approach 4: Jsonnet / CUE

For teams that need programmatic config generation — actual code, not templates — Jsonnet and CUE are the serious alternatives.

// deployment.jsonnet
local k = import "k.libsonnet";

local deployment(name, image, replicas=1) =
  k.apps.v1.deployment.new(name, replicas, [
    k.core.v1.container.new(name, image)
  ]);

{
  "deployment.yaml": deployment("myapp", "myorg/myapp:v1.2.3", replicas=3)
}

Where it works: large platforms where configuration is genuinely complex — many environments, many apps, deep interdependencies. Jsonnet lets you write real functions, share libraries, compose abstractions properly.

Where it breaks:

Steep learning curve. Jsonnet is a full language. CUE even more so — it has types, schemas, and a constraint system that takes time to internalise.
Small community. Excellent tooling, but you’re solving problems that have fewer Stack Overflow answers.
Overkill for most setups. If you’re not managing hundreds of services across multiple clusters, Helm is simpler and has everything you need.

Jsonnet is used seriously at Google-scale infrastructure teams and in some CNCF projects. For a homelab or a small-to-medium platform, it’s the right answer to a question you probably aren’t asking yet.

Approach 5: App-of-apps with generated Application CRDs

This is the ArgoCD-native meta-layer. Instead of managing manifests, you manage Application resources — and potentially use a chart or tool to generate those too.

A naive version: commit a folder of Application YAML files to Git, one per service. ArgoCD watches the folder and deploys each app.

A more sophisticated version: one “root app” that points to a chart, which generates all the other Application resources dynamically from a single config file.

Where it works: at the platform level, not the individual app level. App-of-apps is how you manage what ArgoCD manages, not how you write the service manifests themselves. Combined with Helm, it gives you centralized control over the entire cluster’s structure.

Where it breaks:

Manual Application CRDs are painful. If you’re maintaining a folder of hand-written Application YAML files — one per service — you’ve traded manifest copy-paste for Application copy-paste. Each app needs its own CRD with its repo URL, path, sync policy, project reference.
Sync ordering matters. The root app must exist before children can sync. Get the wave ordering wrong and apps try to deploy before their namespaces exist.

How this homelab compares

My setup sits at the far end of approach 5, using Helm throughout.

There’s a single applications.yml file that describes every service in the cluster. A root Helm chart reads it and generates all the ArgoCD Application and AppProject CRDs automatically. Adding a service means adding an entry to that file — not touching five different places across five different files.

# applications.yml — this is the entire service catalog
- namespace: web-vaultwarden
  networkPolicies:
    profile: web-app
  applications:
    - applicationCode: web-vaultwarden
      path: helm-charts/extra-objects
      autoSync: true

That one entry generates: a Namespace, an ArgoCD AppProject, an ArgoCD Application, a set of Cilium NetworkPolicies (deny-all with ingress from Traefik and DNS/HTTPS egress), and a ServiceAccount. Nothing is written by hand.

The actual service manifests live in an extra-objects chart — a thin wrapper that renders raw YAML from values files. No templating in the service manifests themselves (they’re simple enough not to need it), but the infrastructure scaffolding around each app is entirely generated.

The result: every service gets the same operational properties. Same GitOps workflow, same secret management, same network isolation, same TLS termination. The platform work was done once. Adding a new app is writing manifests for the app’s specific behavior, not recreating the scaffolding.

The honest spectrum

Approach	Templating	Abstraction	Ecosystem	Complexity
Raw manifests	None	None	None	Low
Kustomize	Patches only	Overlays	Medium	Low-medium
Helm	Full	Per-chart	Large	Medium
Jsonnet/CUE	Full + typed	Libraries	Small	High
App-of-apps	Depends	Platform-level	ArgoCD-native	High

Most setups should start at Helm. Kustomize if you’re multi-environment and comfortable with patching. App-of-apps when you’re managing the platform layer, not individual services. Jsonnet/CUE when you know you’ve outgrown Helm — which is a specific and relatively rare problem to have.

Raw manifests are fine for learning. They’re the wrong answer for anything you intend to maintain.

More on how the homelab is structured: My Homelab Runs on GitOps.

🔄 Someone kubectl apply'd a Hotfix Directly. How Do You Detect and Prevent It?

Fri, 06 Jun 2025 00:00:00 +0000

The question

“How do you prevent configuration drift in a Kubernetes cluster?”

Configuration drift: the cluster’s actual state diverges from what’s declared in your source of truth. Someone runs kubectl edit deployment myapp to bump a memory limit during an incident. Someone adds a debug sidecar directly. Someone applies a YAML file from their laptop that was never committed to Git. The fix works. It goes undocumented. Six months later, a new deployment overwrites it. The incident recurs.

There are two distinct problems here that require different solutions:

Detection and remediation: how do you notice drift and revert it?
Prevention: how do you stop non-compliant resources from being created in the first place?

Detection and remediation: Argo CD selfHeal

If you’re using GitOps with Argo CD, detection and remediation are handled for you:

syncPolicy:
  automated:
    prune: true
    selfHeal: true

selfHeal: true means Argo CD continuously compares the cluster state to the Git repo and reverts any divergence. Someone runs kubectl edit deployment myapp and changes the replica count? Argo CD detects the diff on its next reconciliation cycle (default: every 3 minutes) and reverts it.

prune: true means resources that exist in the cluster but not in Git are deleted. Someone kubectl apply’d a debug pod directly? Gone on the next sync.

This is the audit trail story too. Every legitimate change is a Git commit with an author, a timestamp, and a commit message. Everything that isn’t in Git doesn’t survive past the next reconciliation. If you want to know what changed and when, git log is the answer.

The gap selfHeal doesn’t close

selfHeal reverts drift after the fact. There’s a window — up to 3 minutes — where a drifted resource is serving traffic. For most changes, that’s fine. For a bad resource (wrong RBAC, missing network policy, container running as root), 3 minutes is enough to be a problem.

The other gap: selfHeal doesn’t tell you who made the change or generate an alert. It just silently fixes it. You need audit logging (kube-apiserver --audit-log-path) or an alerting rule on Argo CD’s health events to know that drift happened.

Prevention: Kyverno

Kyverno is a policy engine that runs as a Kubernetes admission webhook. Every resource creation or modification goes through it before being persisted. If the resource violates a policy, Kyverno can reject it outright (enforce mode) or allow it with a warning (audit mode).

The policies are Kubernetes resources themselves — they live in Git, they’re applied via GitOps, they’re versioned. No separate policy language to learn.

A policy that requires readiness probes on all Deployments:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-readiness-probe
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-readiness-probe
      match:
        any:
          - resources:
              kinds:
                - Deployment
      validate:
        message: "Deployments must define a readiness probe."
        pattern:
          spec:
            template:
              spec:
                containers:
                  - (name): "*"
                    readinessProbe:
                      (httpGet | tcpSocket | exec): "*"

With this policy active: kubectl apply -f deployment-without-probe.yaml is rejected at the API server. The error message is the one you defined in message. The deployment never reaches etcd.

A policy that blocks containers running as root:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-root-containers
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-runAsNonRoot
      match:
        any:
          - resources:
              kinds: [Deployment, StatefulSet, DaemonSet]
      validate:
        message: "Containers must not run as root."
        pattern:
          spec:
            template:
              spec:
                containers:
                  - (name): "*"
                    securityContext:
                      runAsNonRoot: true

A policy that enforces resource limits (common in multi-tenant clusters):

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-limits
      match:
        any:
          - resources:
              kinds: [Deployment]
      validate:
        message: "CPU and memory limits are required."
        pattern:
          spec:
            template:
              spec:
                containers:
                  - resources:
                      limits:
                        memory: "?*"
                        cpu: "?*"

Kyverno can also mutate and generate

Policies aren’t only for validation. Kyverno can mutate incoming resources (add default labels, inject sidecars, set default resource requests) and generate new resources in response to events (create a NetworkPolicy whenever a new namespace is created).

Auto-add a standard label to every Deployment:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: add-labels
spec:
  rules:
    - name: add-team-label
      match:
        any:
          - resources:
              kinds: [Deployment]
      mutate:
        patchStrategicMerge:
          metadata:
            labels:
              managed-by: kyverno

Auto-create a default NetworkPolicy when a namespace is created:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: add-default-networkpolicy
spec:
  rules:
    - name: default-deny
      match:
        any:
          - resources:
              kinds: [Namespace]
      generate:
        kind: NetworkPolicy
        name: default-deny-all
        namespace: "{{request.object.metadata.name}}"
        data:
          spec:
            podSelector: {}
            policyTypes:
              - Ingress
              - Egress

The complete drift prevention picture

Developer runs: kubectl apply -f bad-deployment.yaml
  → API server receives request
  → Kyverno admission webhook intercepts
  → Policy check: no readiness probe → Rejected
  → API server returns 403 with Kyverno's message
  → Resource never reaches etcd

Developer runs: kubectl edit deployment myapp (valid change, just not via Git)
  → Edit succeeds (no policy violation)
  → Argo CD reconciliation fires (within 3 minutes)
  → Diff detected: cluster state ≠ Git state
  → selfHeal: revert to Git state
  → If audit logging enabled: event recorded with username and timestamp

Git is the audit trail for what should be there. kube-apiserver audit logs are the trail for what was attempted. Kyverno is the enforcer at admission time. Argo CD is the continuous reconciler. Four layers, each with a different job.

What interviewers are actually testing

The follow-up is usually: “What’s the difference between Kyverno and OPA Gatekeeper?”

Both are admission webhook policy engines. The practical differences:

Kyverno: policies are k8s-native YAML, no separate language to learn. Generate and mutate policies built in. Easier to get started with.
OPA Gatekeeper: policies are written in Rego, a purpose-built policy language that’s more expressive but has a steeper learning curve. Better if you’re already using OPA elsewhere (Terraform, microservice authorization).

For a Kubernetes-only environment, Kyverno is the pragmatic choice. For a platform team that uses OPA across the stack, Gatekeeper gives you policy consistency.

The deeper follow-up: “How do you test policies before enforcing them?” Use Audit mode first (validationFailureAction: Audit). Violations are logged as PolicyReport objects but requests aren’t rejected. Review the reports, fix the existing violations, then switch to Enforce. Never flip directly to Enforce in production — you’ll break things that were already running.

This is part of a series on Kubernetes interview questions. Previously: network isolation between services.

🔑 Deploy to Kubernetes Without Storing Any Cluster Credentials in CI

Fri, 09 May 2025 00:00:00 +0000

The question

“How would you design a CI/CD pipeline that deploys to Kubernetes without storing any cluster credentials anywhere?”

The expected wrong answer: export your kubeconfig, base64-encode it, paste it into a CI secret named KUBE_CONFIG, and call it a day. This works. Most clusters that got hacked had this setup.

There are two correct answers in 2026, and which one you reach for depends on what you’re actually deploying.

Answer 1: GitOps (the one your interviewer probably wants)

In a GitOps setup, your CI pipeline never touches the cluster. It can’t leak credentials it doesn’t have.

The flow:

Developer pushes code
  → CI builds and tests
  → CI updates the image tag in the Git repo (a commit, not a kubectl command)
  → Argo CD detects the change
  → Argo CD applies it to the cluster

The cluster reaches out to Git. CI never reaches into the cluster. The only thing with cluster credentials is Argo CD itself — running inside the cluster, with no credentials to leak externally.

For self-hosted setups on Hetzner or Vultr, this is particularly clean because there’s no cloud IAM to configure. You point Argo CD at your GitLab repo, tell it which branch to watch, and you’re done.

# The Argo CD Application CRD — the only thing you need
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
spec:
  source:
    repoURL: https://gitlab.example.com/myorg/myapp
    targetRevision: main
    path: helm-charts/myapp
  destination:
    server: https://kubernetes.default.svc
    namespace: myapp
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

selfHeal: true means if someone manually kubectl applys something, Argo CD reverts it. The Git repo is the only source of truth.

The CI image-tag update step looks like this:

# .gitlab-ci.yml
deploy:
  stage: deploy
  script:
    - |
      # Update the image tag in values.yaml and push
      sed -i "s/tag: .*/tag: ${CI_COMMIT_SHORT_SHA}/" values/myapp.yml
      git config user.email "ci@example.com"
      git config user.name "CI"
      git add values/myapp.yml
      git commit -m "chore: bump myapp to ${CI_COMMIT_SHORT_SHA}"
      git push

CI needs write access to the Git repo — but that’s a deploy key, not a cluster credential. If it leaks, someone can push code. You’d rotate the deploy key and audit the commits. If a cluster credential leaks, someone owns your cluster.

Answer 2: OIDC federation (for when you genuinely need push-based)

Some operations don’t fit the GitOps model. Infrastructure provisioning (terraform apply), one-off database migrations, or initial cluster bootstrapping — these need direct cluster access. The correct pattern here is OIDC federation.

The idea: your CI platform (GitLab, GitHub Actions) already issues JWT tokens to every job. These JWTs are signed by the CI platform and contain claims like which repo, which branch, which pipeline triggered the job. You configure your Kubernetes API server to trust those JWTs, and the CI job authenticates directly using the token it already has.

No stored credentials. Every job gets a fresh token. The token expires when the job ends.

For a self-hosted GitLab, configure your k8s API server to trust GitLab as an OIDC issuer:

# /etc/rancher/k3s/config.yaml (or kube-apiserver flags)
kube-apiserver-arg:
  - "oidc-issuer-url=https://gitlab.example.com"
  - "oidc-client-id=your_client_id"
  - "oidc-username-claim=sub"
  - "oidc-groups-claim=groups_direct"

Then create a ClusterRoleBinding that maps a specific GitLab identity to a Kubernetes role:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gitlab-ci-deployer
subjects:
  - kind: User
    name: "project_path:myorg/myapp:ref_type:branch:ref:main"
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: deploy-role
  apiGroup: rbac.authorization.k8s.io

The subject name is the sub claim from the GitLab JWT — it encodes the repo path and branch. Only jobs running on main in myorg/myapp get this binding. A job on a feature branch gets nothing.

In the CI job:

deploy:
  stage: deploy
  id_tokens:
    K8S_TOKEN:
      aud: your_client_id
  script:
    - |
      kubectl config set-credentials gitlab-ci \
        --token="${K8S_TOKEN}"
      kubectl config set-context deploy \
        --cluster=mycluster \
        --user=gitlab-ci
      kubectl config use-context deploy
      kubectl rollout restart deployment/myapp -n myapp

The token in K8S_TOKEN is injected by GitLab. It expires with the job. The API server validates the signature against GitLab’s JWKS endpoint on every request.

Which one to use

	GitOps	OIDC federation
CI needs cluster access	No	Yes (short-lived token)
Audit trail	Git history	kube-apiserver audit log
Revocability	Revert the commit	Token expires with the job
Self-hosted setup effort	Low	Moderate (OIDC config)
Works for infra provisioning	Not really	Yes

For application deployments: GitOps. The cluster reconciles continuously, drift is impossible, and CI is completely decoupled from cluster state.

For infrastructure provisioning or one-off operations: OIDC federation. Short-lived credentials, branch-scoped permissions, nothing to rotate.

What you should never do: store a kubeconfig or a long-lived ServiceAccount token in CI secrets. Not because it’s hard to make work — it’s easy — but because the blast radius of a leak is unbounded, there’s no audit trail, and there’s no expiry. Everything that goes wrong with static secrets goes wrong eventually.

This is part of a series on Kubernetes interview questions. Next: how to handle secrets in a GitOps repository.

🏗️ My Homelab Runs on GitOps. Here's What That Actually Means.

Fri, 28 Mar 2025 00:00:00 +0000

Why this exists

I’ve been working in DevOps and platform engineering long enough to know what I don’t know. The patterns that separate robust infrastructure from “it works on my machine” infrastructure — GitOps, admission policies, network segmentation, secrets management — are easy to read about. They’re harder to actually internalise without running them yourself.

So I built a homelab. An old ThinkCentre I had sitting around, k3s, and a rule I set for myself before writing a single line of configuration: GitLab is the only source of truth. No manual kubectl after bootstrap. All changes go through git push.

That rule turned out to be more consequential than I expected.

The stack

The cluster runs about thirty services across two categories: infrastructure that makes the platform work, and applications that actually do things.

Infrastructure:

k3s — lightweight Kubernetes, single-node
Cilium — CNI with NetworkPolicy support (Flannel, k3s’s default, silently ignores NetworkPolicies)
Argo CD — GitOps reconciler, watches the repo, applies changes
Traefik — ingress controller, two entrypoints
Cloudflare tunnel — external access without open ports
cert-manager — wildcard TLS cert via Let’s Encrypt DNS-01
oauth2-proxy — GitLab SSO protecting everything by default
Vault + External Secrets Operator — secrets management
Pi-hole — local DNS for *.hippotion.com

Applications: a media server (Jellyfin, *arr stack), Immich for photos, Vaultwarden for passwords, Home Assistant, n8n for automation, a Hugo blog, Obsidian via browser-based KasmVNC, and a few custom-built things I’ll get to below.

Traffic reaches the cluster in two ways

External traffic (from anywhere on the internet) goes through a Cloudflare tunnel. The cloudflared pod dials out to Cloudflare — no open ports on the server, no firewall rules, no exposed IP. Cloudflare terminates TLS and forwards plain HTTP to Traefik on port 7080. Cloudflare handles the certificate for external visitors.

Local traffic (home WiFi) goes through Pi-hole, which resolves *.hippotion.com to the server’s LAN IP. Traefik receives HTTPS on port 443, served with a wildcard certificate that cert-manager issues from Let’s Encrypt via DNS-01 challenge. Port 80 redirects to 443; the cloudflare entrypoint on 7080 does not redirect, because it’s already receiving plain HTTP from cloudflared.

The result: the same IngressRoute handles both paths.

spec:
  entryPoints:
    - cloudflare   # plain HTTP from the cloudflared pod
    - websecure    # local HTTPS with wildcard cert
  routes:
    - match: Host(`myapp.hippotion.com`)
      kind: Rule
      middlewares:
        - name: oauth-auth
          namespace: sys-oauth2-gitlab
      services:
        - name: myapp
          port: 8080

Every IngressRoute has both entrypoints. If you forget one, the service is unreachable from half your access paths. Learned that the first time I added an app and couldn’t reach it from the phone.

One file generates everything

The centrepiece of the setup is applications.yml — a single file that is the complete list of everything running in the cluster. Every entry generates a Namespace, an Argo CD AppProject, an Application, NetworkPolicies, and RBAC. Nothing is created anywhere else.

An entry looks like this:

- namespace: web-vaultwarden
  networkPolicies:
    profile: web-app
  applications:
    - applicationCode: web-vaultwarden
      path: helm-charts/extra-objects
      autoSync: true

Six lines. That deploys a namespace, an Argo CD app that watches helm-charts/extra-objects/values-web-vaultwarden.yml, a full set of Cilium NetworkPolicies based on the web-app profile (deny-all with ingress from Traefik and egress to external), and a ServiceAccount. Adding a new service to the cluster is this file plus a values file with the actual Kubernetes manifests.

The profile: web-app notation deserves a word. Raw NetworkPolicy YAML is repetitive and error-prone — every namespace needs a deny-all base plus specific allows. I template it. A Helm chart maps profile names to concrete policy sets. web-app means: deny all ingress except from the ingress namespace, deny all egress except DNS and external HTTPS. web-app-internal means the same but no external egress — suitable for services that only talk to other in-cluster services. media-server adds port 6881 for BitTorrent. The policies are generated; no one writes them by hand.

Secrets without storing them in Git

Kubernetes Secret objects are not secrets. They’re base64-encoded blobs in etcd, and base64 is not encryption. Committing them to a Git repo — even a private one — is the wrong answer.

The setup here uses HashiCorp Vault as the actual secret store, with External Secrets Operator syncing Vault paths to Kubernetes Secrets. What lives in Git is an ExternalSecret CRD:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: myapp-credentials
  namespace: myapp
spec:
  secretStoreRef:
    name: vault
    kind: ClusterSecretStore
  target:
    name: myapp-credentials
  data:
    - secretKey: DB_PASSWORD
      remoteRef:
        key: secret/myapp
        property: db-password

This is safe to commit. It says where the secret lives, not what it is. Vault contains the actual value. ESO syncs it to the cluster and refreshes every hour. Rotation means updating the value in Vault — no Git commit, no deployment.

Vault runs in-cluster with a sidecar that auto-unseals on restart. Not production-grade (the unseal key is on the same PVC as Vault itself), but pragmatic for a homelab where availability matters more than a sophisticated key management ceremony.

Three things I built that were worth building

Local AI inference

The cluster runs a local LLM. The web-ai-engine namespace has Open WebUI fronting a llama-server serving Phi-3.5 Mini in GGUF format. The model file lives on the node’s filesystem, mounted as a hostPath volume.

web-openclaw is a personal AI assistant UI that can route requests to either external providers (via NVIDIA’s API) or the local llama-server, depending on the task. The local model handles things that don’t need to leave the house; the external API handles things that do. The network policy for web-openclaw explicitly allows egress to web-ai-engine and nowhere else for local inference.

Running a 3.8B parameter model on homelab hardware is genuinely useful and costs nothing per query. It’s not GPT-4, but for summarisation, first drafts, and things you don’t want sending to a third-party API, it’s more than good enough.

Brew Buddy

I make kombucha. I was tracking fermentation batches in a notes app and getting annoyed at not being able to see history across batches. So I built a tracker.

Brew Buddy is a React frontend and a Go API backed by PostgreSQL, all running in the web-brew-buddy namespace. The images are built locally and imported into the cluster’s container runtime with k3s ctr images import. It’s deployed like any other app — a values file, an entry in applications.yml, a Vault secret for the database password.

The point isn’t the app. The point is that the platform handles a custom hobby project with the same operational properties as Vaultwarden or Immich. Same GitOps workflow, same secret management, same network isolation, same TLS termination. Adding an app to this cluster takes an afternoon of writing manifests and a few seconds of git push. The platform work was done once.

This one has its own post because it took three days and four complete rewrites of oauth2-proxy’s session format to get right.

The short version: the Homer dashboard on the living room TV needed a way to log in without typing credentials on a TV keyboard. I built a device-flow OAuth service — phone scans QR, phone authenticates with GitLab, TV session is created. End session from the phone kills the TV’s session immediately by deleting the oauth2-proxy Redis ticket.

It’s the most overengineered solution to a problem I have, and I don’t regret a minute of it.

What operating this way actually changes

The practical difference of the no-manual-kubectl rule is larger than it sounds.

The audit trail is automatic. Every change to the cluster is a git commit with an author, a timestamp, and a diff. There’s no “what did I change last Tuesday?” — I know exactly what changed last Tuesday, and I can revert it with git revert. The Argo CD UI shows the diff between what’s in Git and what’s running. If there’s a diff, something went wrong.

New services are cheap to add. The platform does the repetitive work — namespace, RBAC, network policies, TLS termination, OAuth protection. Adding a new app is writing the manifests and updating applications.yml. The infrastructure concerns are handled.

Recovery is straightforward. If I rebuild the node (which I’ve done), I run two bootstrap scripts, apply one Argo CD manifest, and the cluster reconciles itself from Git over the next few minutes. The only things that require manual work are the secrets that can’t live in Git — two OAuth credentials and the Cloudflare tunnel token, all recreated by scripts/create-secrets.sh.

Experimentation is safe. I run things on toggleable: true apps that I’m not sure I’ll keep. Turning them off is removing the entry from applications.yml and pushing. Turning them back on is adding it back.

What it doesn’t solve

Bootstrap is manual. The first kubectl apply -f argocd/root-app.yaml happens outside of GitOps by definition. The three bootstrap secrets can’t be in Git. This is unavoidable — you need to trust something before GitOps can take over, and that something is a short manual procedure.

Some things fight the model. k3s’s built-in addon controller rewrites the metrics-server Deployment on every k3s restart, removing a patch needed for Cilium compatibility. The fix is a pod that watches for the revert and reapplies the patch. It works, but it’s a workaround for a component I don’t control.

Single-node means single point of failure. For a homelab, that’s acceptable. For anything important, it’s not.

The honest summary

I set out to learn production-grade Kubernetes patterns, and I did. The GitOps constraint turned out to be the best engineering decision in the project — not because it made things easier in the short term (it didn’t), but because it forced every change through a path that is auditable, reversible, and consistent.

The cluster is a single ThinkCentre running about thirty services, secured by Cilium network policies, authenticated via GitLab SSO, with secrets managed by Vault and all configuration in a Git repo that I could hand to someone tomorrow and they’d understand what’s running and why.

That’s the goal. For a homelab, I’ll call it achieved.

Argocd on hippotion

I Run GitOps for My Brain

The pattern I didn’t know I had

GitOps isn’t the git part

The fix took an afternoon

The part where it gets a little strange

If you keep a vault

📦 Five Ways to Manage Kubernetes Manifests (and Why They're Not All Equal)

The problem everyone hits

Approach 1: Raw manifests

Approach 2: Kustomize

Approach 3: Helm

Approach 4: Jsonnet / CUE

Approach 5: App-of-apps with generated Application CRDs

How this homelab compares

The honest spectrum

🔄 Someone kubectl apply'd a Hotfix Directly. How Do You Detect and Prevent It?

The question

Detection and remediation: Argo CD selfHeal

The gap selfHeal doesn’t close

Prevention: Kyverno

Kyverno can also mutate and generate

The complete drift prevention picture

What interviewers are actually testing

🔑 Deploy to Kubernetes Without Storing Any Cluster Credentials in CI

The question

Answer 1: GitOps (the one your interviewer probably wants)

Answer 2: OIDC federation (for when you genuinely need push-based)

Which one to use

🏗️ My Homelab Runs on GitOps. Here's What That Actually Means.

Why this exists

The stack

Traffic reaches the cluster in two ways

One file generates everything

Secrets without storing them in Git

Three things I built that were worth building

Local AI inference

Brew Buddy

QR device login

What operating this way actually changes

What it doesn’t solve

The honest summary