Cilium on hippotion

🌱 My Second Brain Weeds Itself Now

Fri, 27 Feb 2026 00:00:00 +0000

A few weeks ago I rebuilt my second brain as a folder of markdown in git — vault is the source of truth, everything else (search index, graph, 3D viewer) is a derived layer I can delete and rebuild. I love it. But a knowledge base has a dirty secret: it rots.

Not the files — those are fine. The connections rot. You capture a note at 11pm and never link it to anything, so it becomes an orphan floating off the graph. A project note’s one-line summary describes what the project was three weeks ago. Two notes are obviously about the same thing and neither knows the other exists. Do this for a few months and you don’t have a second brain, you have a junk drawer with good search.

The honest fix is to weed the garden regularly. The honest truth is that nobody does, including me.

So I stopped relying on myself and built a gardener.

What it actually does

Every night at 3am, on my homelab box, a script runs:

Detect — exo garden, a plain query over the index, produces a report: here are the orphans, here are notes that should probably link to each other, here are summaries that look stale. No AI in this step. It’s SQL and graph traversal. Deterministic, boring, trustworthy.
Decide and write — that report gets piped to claude -p (Claude Code in headless mode). Claude reads the vault’s operating contract, makes only high-confidence edits — add a [[wikilink]] between two genuinely related notes, refresh a stale summary — caps itself at ~10 notes a night, and writes a dated log note explaining exactly what it changed and what it deliberately skipped.
Commit — the wrapper reindexes and lands everything as a single garden: 2026-06-09 … git commit, then pushes. My 3D graph viewer picks it up on the next sync.

The first real run, it found one orphan (90-meta/README), linked it into the notes it actually indexes, and then — this is the part I liked — declined to touch the 12 “stale summary” candidates because, on inspection, every one of them was already accurate. It wrote: “flagged by length, not staleness; churning them would add noise.” A gardener that knows when not to prune is the one you can leave alone.

“Isn’t this a solved problem?”

Mostly, no — but partly, yes, and I want to be straight about it. AI-assisted note-linking exists: Obsidian plugins like Smart Connections suggest related notes, and apps like Mem and Reflect auto-organize as you write. They’re good.

Three things make this different enough to build:

Every change is a reviewable git diff, authored by a named agent. Not silent magic that rearranges your notes while you’re not looking. git log -p shows you exactly what the gardener did last night; git revert undoes a bad night in one command. For something as personal as a knowledge base, “show me the diff” beats “trust me.”
It’s mine, end to end. Runs on my hardware, on my schedule, with a model I point at. No SaaS holds my brain hostage.
The detection is deterministic; the model only acts. The LLM never decides what’s wrong — a boring query does that. The model only decides how to fix the things already found. That split keeps the whole thing auditable and cheap.

If you already live in a tool that does this and you trust it, great. I wanted the git-diff trail and the local control.

The part I actually want to tell you about

The plan was tidy: I run n8n on the same cluster, so n8n would be the scheduler — fire nightly, SSH into the node, run the gardener. Clean, visual, one workflow.

n8n could not reach the node. At all. Every port: ECONNREFUSED.

This sent me down a genuinely interesting hole, because the homelab runs Cilium for networking, and Cilium has opinions about your own node that plain Kubernetes does not.

First instinct: a NetworkPolicy allowing egress to the node’s IP. Wrote it, synced it, still refused. The reason is a Cilium subtlety worth knowing: the node isn’t a CIDR, it’s an identity. Cilium classifies your cluster’s own node as the special host identity, and ordinary ipBlock CIDR rules do not match it unless you flip a cluster-wide setting (policy-cidr-match-mode: nodes). My 192.168.0.109/32 rule was a no-op.

So I switched to the Cilium-native tool: a CiliumNetworkPolicy with toEntities: [host]. Confirmed it applied — I could see reserved:host allowed right there in the datapath’s BPF policy map. I confirmed the node’s IP really does resolve to identity 1 (host). I confirmed the host firewall was disabled. Everything said “allowed.”

Still ECONNREFUSED.

That’s the wall. The packet leaves the pod with Cilium’s blessing, hits the host’s own network stack, and something there sends a reset — and I couldn’t see what, because inspecting the host firewall needs root, and this automation deliberately doesn’t have it. I could have kept digging with a password. But I stopped and asked a better question: why am I making a pod reach back into the host it’s running on at all?

That’s an awkward direction. The work has to happen on the host (that’s where the vault, git creds, and Claude live). A pod straining to SSH into its own node is fighting the grain of the platform.

So I inverted it. The node schedules itself — a plain cron entry, rock-solid, no network gymnastics. And n8n, instead of triggering the job, receives it: at the end of each run the node POSTs a summary to an n8n webhook. Node→n8n works perfectly (it’s just an outbound HTTPS call to a URL). n8n keeps the run history and is the place I’ll later wire a phone notification.

I lost nothing that mattered. n8n is still my dashboard; the schedule just lives where the work lives. And I deleted the SSH key and the network-policy hole I’d opened — the cleanup felt better than the original plan would have.

The lesson, such as it is

Two, actually.

One: when you’re automating something to run unattended, the bug you want to find is the one that shows up in a dry run at 2pm, not at 3am three weeks from now. I almost shipped a version where a brand-new note (untracked by git) was invisible to my change-detection and would’ve been silently wiped each night. The dry run caught it. Always build the dry run.

Two, the bigger one: I spent an hour trying to make a pod punch into its host because that was my plan, and the platform kept saying no in increasingly specific ways. The fix wasn’t a cleverer NetworkPolicy. It was noticing I was pushing against the design and turning around. The node scheduling itself and reporting up to n8n is simpler, safer, and more honest about where the work actually lives.

My brain weeds itself now. Every morning there’s maybe one small, sensible commit waiting — a link I’d have never made, a summary nudged back to true — and I can read exactly what changed before my coffee’s done. That’s the whole dream of a second brain that isn’t a junk drawer: it stays a garden, and I barely have to touch it.

🧱 How Do You Isolate Two n8n Tenants on Kubernetes — and Prove Each Wall Holds?

Fri, 19 Dec 2025 00:00:00 +0000

The question

“You’re running n8n for multiple customers on the same Kubernetes cluster. What stops Customer A from reading Customer B’s API keys, calling Customer B’s services, or starving Customer B’s workflows by burning the whole node?”

Three different walls, three different mechanisms. Most articles I’ve read on K8s multi-tenancy list the primitives — namespaces, NetworkPolicies, ResourceQuotas, RBAC — without showing what each one actually catches when you try to cross it. This post does the second part. The receipts are the point.

The setup: two namespaces, web-tenant-acme and web-tenant-globex, each running their own n8n instance on the same node. The only thing keeping them apart is the walls we build around each namespace.

The mental model: subtractive isolation

Kubernetes is a flat network with shared everything by default. You don’t add isolation by writing allow rules. You subtract trust by adding default-deny rules, and then carefully allow back only the connections each tenant actually needs.

A tenant doesn’t have access to another tenant because there is no rule allowing it. The absence of an allow rule is the wall.

Three of these absences make up the picture:

Wall	Primitive	Failure mode when crossed
Network	Cilium NetworkPolicy, default-deny egress	Connection times out (silent drop)
Secret	Vault Kubernetes-auth, per-tenant policy	`403 permission denied` from Vault itself
Resource	ResourceQuota + LimitRange	Pod rejected at admission time

Different layers, different error messages. That’s how you can tell what stopped you.

Wall 1 — Network: Cilium NetworkPolicy

n8n in web-tenant-acme can reach whoami.web-tenant-acme.svc.cluster.local (its own service in its own namespace) but not whoami.web-tenant-globex.svc.cluster.local. The same DNS shape, the same cluster, the same node. One succeeds, the other hangs.

The primitive is a default-deny egress policy applied to every pod in the namespace, with two narrow exceptions: intra-namespace traffic (so n8n can still reach its own service) and DNS to kube-system (otherwise nothing resolves anything).

# Effective policy on every pod in web-tenant-acme:
spec:
  podSelector: {}
  policyTypes: [Egress, Ingress]
  egress:
    - to:                                     # intra-namespace traffic OK
        - podSelector: {}
    - to:                                     # DNS to kube-dns OK
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
      ports: [{port: 53, protocol: UDP}]

There is no rule for web-tenant-globex. Cilium’s eBPF datapath drops the SYN packet on the way out.

The receipt — an n8n HTTP node configured to GET http://whoami.web-tenant-globex.svc.cluster.local/. It hangs for the full timeout, then errors with AxiosError: timeout of 5000ms exceeded / code: ECONNABORTED.

The interesting bit: DNS still works. kube-dns is allowed, so the cross-namespace Service still resolves. The TCP handshake is what gets dropped. That’s a useful signal in real incident response — “DNS resolves but the connection hangs” almost always means a NetworkPolicy is the cause.

Wall 2 — Secret: Vault Kubernetes-auth + ESO

Now imagine Acme’s n8n misbehaves: somebody pushes a workflow that tries to read Globex’s API keys via an ExternalSecret. The network isn’t the issue — both tenants need to reach Vault, so they both have an egress rule for sys-vault. The wall has to be at the identity layer.

Each tenant gets three things:

A dedicated ServiceAccount (n8n-acme, n8n-globex).
A Vault Kubernetes-auth role bound to that SA in that namespace, mapped to a Vault policy that grants read on only its own KV path.
A namespaced External Secrets SecretStore that authenticates as the SA via the Kubernetes TokenRequest API.

# Vault policy: tenant-acme can read its own secrets, nothing else.
path "secret/data/web-tenant-acme"     { capabilities = ["read"] }
path "secret/metadata/web-tenant-acme" { capabilities = ["read"] }

vault write auth/kubernetes/role/tenant-acme \
  bound_service_account_names=n8n-acme \
  bound_service_account_namespaces=web-tenant-acme \
  policies=tenant-acme \
  ttl=1h

When Acme’s n8n tries an ExternalSecret pointing at secret/web-tenant-globex/..., ESO authenticates fine (the SA is valid), Vault recognises the caller, looks up the tenant-acme policy, and answers with the most satisfying line in this whole demo:

URL: GET http://sys-vault.sys-vault.svc.cluster.local:8200/v1/secret/data/web-tenant-globex
Code: 403. Errors:
* permission denied

This is the bit that separates “namespace isolation” from real multi-tenant secret isolation. Plain Kubernetes Secrets + RBAC stop a tenant from listing another tenant’s Secret objects, but the moment you go upstream — to Vault, to a cloud KMS, to an SSM Parameter Store — the secret store needs to enforce identity itself. The network said yes; the secret store still says no.

Wall 3 — Resource: ResourceQuota + LimitRange

The third concern is the noisy neighbour: Acme’s runaway workflow allocating a 4Gi pod and OOM-killing everything else on the node. The network policy doesn’t catch this (no network call), and Vault doesn’t catch this (no secret request). The kernel will, eventually — but you don’t want eventually. You want admission-time rejection.

Two primitives:

apiVersion: v1
kind: ResourceQuota
metadata: { name: tenant-quota, namespace: web-tenant-acme }
spec:
  hard:
    requests.cpu:    "1"
    requests.memory: 1Gi
    limits.cpu:      "2"
    limits.memory:   2Gi
    pods:            "10"
---
apiVersion: v1
kind: LimitRange
metadata: { name: tenant-limits, namespace: web-tenant-acme }
spec:
  limits:
    - type: Container
      default:        { cpu: 500m, memory: 512Mi }
      defaultRequest: { cpu: 50m,  memory: 128Mi }
      max:            { cpu: "2",  memory: 1Gi }

ResourceQuota caps the namespace total. LimitRange bounds any individual container and supplies defaults so pods that don’t declare requests/limits still get reasonable ones — important because a missing limit on a single container can blow past the quota in one allocation.

The receipt — a server-side dry-run of a single 4Gi pod, which never gets created:

$ kubectl apply -n web-tenant-acme --dry-run=server -f noisy-neighbor.yaml
Error from server (Forbidden): error when creating "STDIN":
pods "noisy-neighbor" is forbidden:
  maximum memory usage per Container is 1Gi, but limit is 4Gi

Not a kernel OOMKill. Not a pod stuck in Pending. A flat refusal from the API server before the scheduler even sees the request.

What this does not prove

A homelab demo on one node with two synthetic tenants is not n8n Cloud. The honest gaps:

Execution sandboxing. A workflow can still run arbitrary code via the Code node or shell-outs. These walls stop infrastructure leakage; they don’t sandbox what n8n itself executes. Real n8n Cloud needs more than namespace walls for that — gVisor / Firecracker / per-tenant worker pools are the usual answers, and n8n’s queue mode lends itself to the last.
Pooled worker queues. Queue mode runs main/webhook/worker as separate deployments backed by Redis + Postgres. Two tenants sharing a worker pool need additional checks at the job-routing layer to keep workflows from accessing the wrong tenant’s binary data. Out of scope for the homelab demo.
Control plane. Both tenants reach the same API server. A cluster-admin-equivalent compromise breaks everything. This is the assumption every shared K8s setup makes.
Node-level. Same kernel. Container escape, CPU side channels, the usual list — all apply. For paranoid tenants the answer is dedicated nodes via taints/tolerations or separate clusters entirely.

The demo proves the namespace-shaped walls hold. It does not prove the whole stack is safe against a determined attacker already running code inside a tenant. That’s a different post.

Part of a Kubernetes-on-the-homelab series — previously: preventing a compromised pod from calling your database, GitOps secrets.

🛡️ How Do You Prevent a Compromised Pod From Calling Your Database?

Fri, 23 May 2025 00:00:00 +0000

The question

“How do you enforce network isolation between services in a Kubernetes cluster?”

The default Kubernetes network model is flat. Every pod can reach every other pod, in any namespace, on any port. There are no firewalls, no ACLs, no segmentation. A compromised frontend pod can connect directly to your PostgreSQL port, your Redis port, your internal admin API, and every other service in the cluster.

This is intentional — Kubernetes doesn’t assume you want isolation, because not everyone does. But if you do want it, you need to add it.

NetworkPolicy: the primitive

A NetworkPolicy is a Kubernetes resource that selects a set of pods and defines what traffic is allowed to reach them (ingress) and what traffic they’re allowed to send (egress). Traffic that isn’t explicitly allowed is dropped.

The catch: NetworkPolicy resources have no effect unless your CNI plugin supports them. The default k3s CNI (Flannel) does not. Calico, Cilium, and Canal do. If you’re running Flannel and you apply a NetworkPolicy, it will be silently ignored — no error, no warning.

The default-deny pattern

The correct starting point is a default-deny policy that blocks everything, applied to the namespace. You then add explicit allow policies for the traffic you actually need.

# Block all ingress and egress in this namespace by default
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: myapp
spec:
  podSelector: {}        # matches all pods in the namespace
  policyTypes:
    - Ingress
    - Egress

With this in place, your pods can’t receive traffic and can’t send traffic. You then add back what you need.

Allowing specific traffic

Allow the web frontend to receive traffic from the ingress controller:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-ingress-from-traefik
  namespace: myapp
spec:
  podSelector:
    matchLabels:
      app: frontend
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: sys-traefik

Allow the backend to talk to PostgreSQL:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-to-postgres
  namespace: myapp
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
    - Egress
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: postgres
      ports:
        - port: 5432
          protocol: TCP

After these two policies: the frontend receives traffic from Traefik, and the backend can reach Postgres. The frontend cannot reach Postgres. The backend cannot receive traffic from the ingress controller. Neither can call anything else.

The DNS gotcha

Once you add a default-deny egress policy, DNS stops working. Your pods can no longer resolve service names because they can’t reach kube-dns in the kube-system namespace.

You need to explicitly allow it:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-dns
  namespace: myapp
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
      ports:
        - port: 53
          protocol: UDP
        - port: 53
          protocol: TCP

Missing this is the most common reason “everything broke after I added NetworkPolicies”. Add it to every namespace that has a default-deny policy.

Cilium: the same model with more power

Cilium implements the standard NetworkPolicy API and adds its own CiliumNetworkPolicy CRD with L7 capabilities.

Standard NetworkPolicy works at L3/L4 — IP addresses and ports. Cilium’s CRD adds:

L7 HTTP filtering: allow specific HTTP methods and paths, not just port 8080.

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-api-reads
  namespace: myapp
spec:
  endpointSelector:
    matchLabels:
      app: api
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: frontend
      toPorts:
        - ports:
            - port: "8080"
              protocol: TCP
          rules:
            http:
              - method: "GET"
                path: "/api/v1/.*"

DNS-based egress: allow egress to github.com by hostname rather than IP address. This matters for external services with dynamic IPs.

egress:
  - toFQDNs:
      - matchName: "github.com"
    toPorts:
      - ports:
          - port: "443"
            protocol: TCP

Identity-based policies: Cilium assigns a cryptographic identity to each pod based on its labels. Policies are enforced by identity, not IP address. Pod restarts (which change IPs) don’t break policy enforcement.

What a real namespace policy set looks like

For a typical web app with frontend, backend, and database:

Namespace: myapp
├── default-deny-all (ingress + egress, all pods)
├── allow-egress-dns (egress, all pods, port 53)
├── allow-ingress-frontend (ingress frontend, from sys-traefik namespace)
├── allow-egress-frontend-to-backend (egress frontend, to backend:8080)
├── allow-ingress-backend (ingress backend, from frontend)
├── allow-egress-backend-to-postgres (egress backend, to postgres:5432)
└── allow-ingress-postgres (ingress postgres, from backend)

Eight policies. The database has exactly one inbound path: from the backend. The frontend has no path to the database at all. A compromised frontend pod cannot scan the internal network — egress to arbitrary destinations is blocked.

What interviewers are actually testing

The follow-up is usually: “How do you manage this at scale? Writing NetworkPolicies for every namespace by hand doesn’t scale.”

The answer: you don’t write them by hand. You template them. In a GitOps setup, your namespace configuration declares what network access the service needs in a structured form, and a Helm chart or operator generates the actual NetworkPolicy resources from those declarations.

For example, an applications.yml entry might look like:

networkPolicies:
  denyAll: true
  allowIngressFromIngress: true
  allowEgressToNamespaces: ["sys-postgres"]

And a Helm chart translates that into four concrete NetworkPolicy objects. The developer declares intent; the platform enforces it. No one writes raw YAML for each namespace.

The second follow-up: “What about east-west traffic between services in the same namespace?” Add allowIntraNamespace: true as a flag that generates a policy allowing all pod-to-pod traffic within the namespace, while still blocking cross-namespace traffic.

This is part of a series on Kubernetes interview questions. Previously: zero-downtime deployments. Next: preventing configuration drift.