Ai on hippotion

Two Birds That Read the Web for Me: One Hoards, One Scatters

Fri, 12 Jun 2026 00:00:00 +0000

I have a vault of markdown notes that I treat as a second brain, and I run GitOps over it like it’s production infrastructure. It already has agents that work on it from the inside: a nightly gardener that weeds orphans and suggests links, and a Wanderer that collides random pairs of my own notes looking for connections I missed.

The obvious next move is to point an agent at the outside — let it read the web and tell me what matters. That move is also a small landmine, and most “AI reads the internet for you” tooling steps right on it. So this week I built two of them instead of one, named them after corvids, and the reason there are two is the entire point of this post.

Meet the Magpie and the Blue Jay.

The same fear, twice

Before either bird got a name, both inherited a single non-negotiable rule, and it’s worth saying plainly because it’s the part everyone skips:

An agent that reads the internet and writes to your notes is a prompt-injection pipeline aimed straight at your trust root.

My vault isn’t just storage. Every other agent — the gardener, the Wanderer, the search that answers “what am I building?” — reads it as trusted context. So the moment one agent ingests a GitHub README or a news headline (attacker- influenceable text) and is allowed to write a note, a stranger on the internet gets to whisper instructions into the thing my whole system believes. “Structured API” narrows that surface. It does not close it.

Both birds are built on the same chassis as the gardener, and that chassis enforces the fear rather than trusting the model to behave:

Two phases, hard split. A wrapper-owned FETCH step pulls the external text in plain Bash — Claude is not in the loop, can’t be talked into anything, because it isn’t running yet. Then a COLLIDE step starts claude -p with the fetched text handed in as inline data, and that process gets only Read / Glob / Grep / Write. No Bash, no git, no network, no MCP. While untrusted text is in the context window, the agent has no tool that can reach the outside world or rewrite history.
Allowlist, not the open web. Each bird reads a short, named list of sources. Nothing else.
Quarantine, not the vault. Findings land in quarantine//, which lives outside vault/. The indexer never sees it. Nothing it writes is ever auto-wikilinked into the graph. Promotion to a real note is a thing I do, by hand, after reading it.
Blast radius is checked, not assumed. A run may modify only its quarantine directory. Anything written anywhere else is discarded and reported as a violation.
“Nothing found” is a successful run. Neither bird has a quota. This is the honesty contract I stole from the Wanderer — an agent under pressure to produce N findings will manufacture N findings, and manufactured insight is worse than silence.

That’s the shared spine. Now the interesting part: given the same security model, the two birds do almost opposite things, and trying to make one bird do both jobs would have quietly ruined it.

The Magpie hoards what’s already shiny

A magpie collects shiny objects and keeps them close. Mine watches my own GitHub stars.

The premise is slow public signal × private context. I starred some repo three weeks ago, forgot about it, and moved on. Meanwhile my projects shifted. The Magpie runs weekly, pulls my starred repos through one allowlisted endpoint (gh api user/starred), and collides each one against what I’m actively building right now — the live projects, the open hubs.

Its output contract is a tight one: it is a relevance filter. It fires only when a star actually touches live work, and every finding has to name three concrete things — the repo, the project it connects to, and one “so what.” A vague “these are thematically related” doesn’t count as a hit. It’s a watchdog on the dials, not a newsletter.

The supervised proof run, over 28 stars, surfaced exactly two real hits and refused to invent a third:

supertonic (on-device multilingual TTS) × my Hungarian-audiobook voice-cloning project — a possible escape from a TTS fight I’d been losing. I checked: it genuinely supports Hungarian. That’s a hit with a so-what.
agentmemory × the exocortex itself — prior art for persistent AI memory, notably with benchmarks my own notes lacked. (And if you’ve read about the time I benchmarked my own search and it lost, you’ll know how much I needed that nudge.)

The other ~22 stars mapped to tidy thematic clusters and were correctly not reported. That restraint is the feature.

The Blue Jay scatters acorns and forgets where

Here’s the bird that explains why there are two.

Blue jays don’t hoard close like magpies. They cache acorns far and wide and forget where they buried some — and the forgotten ones grow into oak trees. Ecologists think blue jays are why oak forests spread north after the last ice age. Seed dispersal, by way of a bad memory. That is exactly the job I wanted for the second bird, and the metaphor was too good to pass up.

The Blue Jay reads an allowlist of eight RSS feeds, picked so tech and science cross-pollinate:

Tech: Hacker News (high-score front page), lobste.rs, Ars Technica
Science & ideas: phys.org, Quanta, Aeon, Nautilus
Wildcard: Medium — but scoped to specific tag feeds, never the raw firehose of crypto and self-help

Quanta, Aeon, and Nautilus are on that list on purpose: they’re the connective tissue, the feeds where “huh, that’s weirdly similar to…” happens before my vault even gets involved.

And its output contract is the opposite of the Magpie’s. The Blue Jay is a serendipity filter. Its job is to surface the connection that isn’t in my projects yet — the distant idea, the acorn worth burying. If I ran it through the Magpie’s “only fire on a live-work hit” rule, I would strangle the one thing it exists to do. Relevance and serendipity pull in opposite directions, and you can’t tune a single agent to maximize both.

One more load-bearing detail, half design and half security: the Blue Jay collides on the RSS summary only — title, abstract, link. It never pulls the full article body into context. That’s simultaneously the lower-injection path and the right cognitive shape (a headline is a seed; I click through myself from quarantine if the seed is interesting). The narrow input is doing double duty.

Why two birds and not one with a flag

I genuinely considered making this one agent with a --mode=relevance|serendipity switch. I’m glad I didn’t, and the reasoning generalizes past birds:

	Magpie	Blue Jay
Source	my GitHub stars (structured API)	8 RSS feeds (open prose)
Injection risk	low	the highest frontier
Fires when	a star hits live work	a summary sparks a distant idea
Output	relevance: repo → project → so-what	serendipity: the not-yet-relevant connection
Failure mode it guards against	noise / false relevance	being strangled into silence

Two things made the split non-negotiable. First, the output contracts are too different to share one brain — “only speak on a hit” and “speak about the thing that isn’t a hit yet” are contradictory prompts, and a single agent told to do both does neither well. Second, open news is a higher injection frontier than a structured stars API, so the riskier bird deserves its own enforced blast-radius wrapper, not a code path bolted onto the safe one. When two jobs disagree on both what good output is and how dangerous the input is, that’s not a flag. That’s two programs.

So now my vault has two more agents reading the world on a cron. The Magpie runs Saturday at 06:00 and tells me when something I bookmarked finally became relevant. The Blue Jay runs Saturday at 07:00 and buries acorns in a quarantine folder, most of which I’ll ignore — but I only need one of them to grow into an oak.

Both are on probation for their first few runs, because I don’t trust a thing that reads the internet until I’ve watched it behave. But the part I’m actually happy about isn’t the agents. It’s that building the second one forced me to say out loud what the first one was secretly assuming — and the names made the difference impossible to forget. A magpie hoards. A blue jay scatters. You want both, and you do not want them to be the same bird.

I Added a Knowledge Graph to My Search. It Made It Worse.

Fri, 05 Jun 2026 00:00:00 +0000

I have a note in my second brain that I wrote months ago. It says, with the confidence of someone who hadn’t measured anything:

Combining lexical search (BM25) with vector similarity and graph expansion produces more robust recall than embeddings alone.

That sentence shipped into production. My vault of markdown notes gets indexed into a search database, and the search function fuses three signals: BM25 (classic keyword ranking), vector similarity (embeddings), and graph expansion — when a note matches, pull in its linked neighbours too, on the theory that the thing you want is often next to the thing you typed.

It sounds right. Graphs are having a moment in RAG. “Add a knowledge graph to your retrieval” is the kind of thing you can put on a slide and nobody pushes back. I believed it enough to make graph expansion a first-class signal with a weight of 0.5 — equal footing with keyword matching.

This week I finally wrote a benchmark. The graph wasn’t helping. It was the single biggest thing hurting my search.

The setup

30 gold queries against the live vault (63 notes), borrowing the harness shape from an eval framework I’d been reading. Each query has a hand-labelled “correct” note. I measured recall@5 (did the right note land in the top 5?) and MRR (how high did it rank?), across three retrievers:

grep — naive substring term-count. The dumb floor.
bm25 — pure keyword ranking, FTS5’s BM25. The honest baseline.
live — my production hybrid (BM25 + vector + graph).

I expected a clean staircase: grep at the bottom, bm25 in the middle, my clever hybrid on top. That’s the whole reason you build the clever thing.

The scorecard

retriever	recall@5	MRR
grep	0.467	0.307
bm25	0.950	0.826
live (hybrid, `w_graph=0.5`)	0.650	0.520

Read that bottom row twice. My production “smart” search found the right note 65% of the time. Plain keyword search found it 95% of the time. The hybrid I’d been quietly proud of was worse than its own baseline — it broke 9 of 30 queries that BM25 got right. BM25 alone whiffed on exactly one.

The clever layer wasn’t adding intelligence. It was adding noise, confidently.

Why the graph backfired

Here’s the mechanism, and it’s almost funny once you see it.

Graph expansion pulls in a matched note’s neighbours. But in a real knowledge base, the most connected notes are hubs — my inbox of ideas, my project radar, my “things Claude noticed” log. Everything links to them, so they’re everyone’s neighbour. When I searched for something specific, the graph helpfully dragged these popularity-contest winners into the candidate set, and they elbowed the genuinely relevant note clean out of the top 5.

Concrete example. Query: “who owns this knowledge system?” The correct answer is my personal note. BM25 ranked it #5 — just barely in. The hybrid, drunk on graph neighbours, pushed it off the list entirely. The graph didn’t find a better answer. It buried a good one under hubs.

I swept the graph weight to confirm it wasn’t a fluke. It was perfectly monotonic — every increment of graph made search worse:

graph weight	recall@5	MRR
0.0 (off)	0.950	0.826
0.1	0.950	0.737
0.25	0.817	0.564
0.5 (what I shipped)	0.650	0.520

There’s no ambiguity to argue with. More graph, more harm, no exceptions. The value I’d been claiming in that confident note — I finally measured it, and it was negative.

The fix, and the actual lesson

The fix was one line: drop the default graph weight from 0.5 to 0.1. Recall snapped back to 0.95, tying pure BM25. (Turning the graph fully off is marginally better still on MRR; I kept a whisper of it as a tiebreaker, which is a taste call, not a data-driven one.)

But the one-line fix isn’t the point. The point is where graphs belong.

Graph expansion isn’t a bad idea — I aimed it at the wrong job. Precision retrieval (“find me the one note that answers this”) wants to be narrow and literal. Pulling in neighbours is the opposite of what you want; every neighbour is a chance to be wrong. But I have a different feature in this same system — a discovery mode that deliberately collides distant notes to surface unexpected connections. There, neighbour-pulling isn’t noise, it’s the entire product.

Same mechanism. One context it’s poison, the other it’s the point. I’d been running my discovery tool inside my lookup tool and calling it a hybrid.

A few honest caveats, because a benchmark you can’t poke holes in is usually lying: my gold set is self-authored v1, the corpus is small (63 notes), and the vector signal was actually dark during this run — I hadn’t built the embeddings yet, so “hybrid” here was really “BM25 + graph.” The vector half of my original claim is still untested. This is directional, not gospel.

But directional was enough. I’d shipped a claim, the claim got measured, and it didn’t survive contact with 30 queries. That’s the whole reason I keep my brain in git with everything reproducible: so the day I bother to measure, the measurement can actually win the argument against my own confident prose.

The slide-deck version of RAG says add a graph. The benchmark says know which question you’re answering first. I’ll take the benchmark.

VoteWatch: How Your Representatives Voted — and Whether You'd Agree

Fri, 15 May 2026 00:00:00 +0000

Open data nobody opens

Every vote in the European Parliament and the Slovak National Council is public. The EU even ships it as a clean API. And almost nobody reads it, because the raw record is unreadable: “Návrh poslanca… ktorým sa dopĺňa zákon č. 581/2004 Z. z. … (tlač 1259) — tretie čítanie, hlasovanie o návrhu zákona ako o celku.” Multiply that by a few hundred votes a sitting. Transparency that no human can parse is transparency on paper only.

So I built VoteWatch — a small site on my homelab that turns the record into something a citizen can actually use: what was decided, who voted, and do you agree?

VoteWatch SK: each decision summarised in plain language, which parties voted how, and a Yes/No question whose live citizen tally sits next to how parliament actually voted — labelled agree or gap.

Two halves, one lopsided

The EU half was easy. HowTheyVote.eu already did the hard work and publishes roll-call votes as a clean, open-licensed API. You consume it; you don’t scrape it.

The Slovak half is where the real work lives — and the real value. nrsr.sk has no API. The HTML is the contract: a results listing, and per-vote pages where each MP appears next to a one-letter code ([Z] za, [P] proti, [?] zdržal sa). So the national half is a genuine scraper — the unglamorous kind that nobody maintains, which is exactly why a gap exists to fill. The unglamorous part is the moat.

From ten votes to one question

A single bill generates a pile of procedural roll-calls — shorten the debate, move to third reading, amendment block A, amendment block B, the bill as a whole. Ten rows that are really one decision. Nobody wants ten rows.

So the pipeline groups votes by bill, then asks an LLM (llama-3.3-70b on NVIDIA NIM) to do exactly one job: turn the bureaucratic titles into a plain headline, two sentences of summary, and one neutral Yes/No question a person can actually answer. Seven votes on the health-insurer bill collapse into: “Changes to the health-insurance law” → “Do you agree with the health-insurance bill?”

The rule that keeps it honest

Here’s the line I won’t cross, and it’s the whole reason I trust the result: the AI writes the prose, but it never decides a fact.

Which votes belong to one bill? Deterministic — parsed from the bill number.
Did it pass? Deterministic — read from the result row.
Which parties voted for, against, abstained? Deterministic — tallied from the per-MP record, shown as Za: SMER-SD, HLAS-SD, SNS · Zdržali sa: PS, KDH, SaS.

The model only touches language: the headline, the summary, the question. If it hallucinates, you get an awkward sentence — never a wrong vote count. And if the model fails entirely, the card falls back to the raw title. The facts come from the record; the model just makes the record legible. For civic data, that separation isn’t a nice-to-have — it’s the difference between a tool and a liability. (Every card says so out loud: summaries are AI-generated; the raw record prevails.)

The part that closes the loop

Showing people how their representatives voted is only half a feedback loop. The other half is letting them answer.

Each decision carries its one distilled question and two buttons — Áno / Nie. You vote, and the site shows the citizen tally next to how parliament actually decided, with the honest verdict on top: "✓ Citizens and Parliament agree" or "⚖ Gap between citizens and Parliament." That gap is the entire point. It’s the thesis behind a side project of mine called veracracy — governance measured against verified knowledge and the actual will of the governed — made concrete enough to click.

The same loop on the European Parliament — dossiers consolidated, political-group stances (EPP, S&D, PfE…), and the citizen poll under each topic.

The backend is deliberately boring. The site is static (git-synced nginx, same as this blog). Votes can’t POST to a static page, so they go to a public n8n webhook that records to a data table and returns live tallies — no new service, no database, just the automation box I already run. Vote keys are namespaced so EU and Slovak polls share one store without colliding.

The honest caveat

Dedup is browser-local. It stops casual double-voting, but behind a Cloudflare tunnel every request shares one IP, so this is an indicative signal, not a secured ballot. That’s the right altitude for “let people express an opinion.” The day it needs to mean more than that, it needs real identity first — and I’d rather ship the honest version than fake the robust one.

It’s live at votewatch.hippotion.com — the EU parliament and the Slovak NR SR, every MEP and every poslanec, in plain language, with a button that asks the only question that matters after a vote: would you have voted the same way?

A neutral record — what was decided and who decided it — not a villain list. Data © HowTheyVote.eu (ODbL) and nrsr.sk.

Veracracy: The Question We Forget to Ask When We Govern

Fri, 08 May 2026 00:00:00 +0000

The missing question

Democracies ask what do we want? Markets ask what will we pay? Both are good questions, and between them they run most of the world. But there’s a third question that almost no system asks before it acts, and it’s the one that decides whether the first two produce anything good:

What do we actually know — and how do we know it?

I gave the idea of governing as if that question came first a name — veracracy, from veritas (truth) and kratos (rule). Not rule by experts in a back room. Rule by evidence that anyone can inspect, deliberated by the people it binds. It lives at veracracy.hippotion.com, and this is the honest account of what it is and why an infrastructure engineer ended up building a shrine to an idea.

The clock reads ~2051 — computed, not wished, from one published assumption. The sun’s height above the horizon is how far the weighted dials have risen; the tag tracks the beacon (Taiwan) against the world.

What it actually means

Strip the romance and veracracy is five fairly concrete commitments:

Truth as civic infrastructure. Verified, open evidence maintained like roads and water — with provenance, versioning, and repair crews. A public utility, not a content feed.
Radical transparency. Binding decisions carry their evidence trail by default. A law without its sources is a claim, not a law.
Decentralised trust. No ministry of truth. Verification is plural, adversarial, and bridging — many checkers, no single owner, consensus that has to span camps to count.
Ethical AI as auditor and advocate. Machines that trace claims, surface contradictions, and argue against the powerful reading of the data — never as oracle, always as instrument.
Participatory epistemocracy. Citizens not as voters once every four years, but as standing jurors of what is true enough to act on, where weight accrues to evidence rather than volume.

If you squint, none of that is a political program. It’s the same instinct I bring to a cluster — provenance, versioning, a reconciler, no single point of trust — pointed at the question of how a society decides what’s real. That’s the only way I know how to think, so that’s the lens I used.

Why a clock, and why it can run backward

Here’s where most “vision” projects lose me, and where I tried not to lose myself. A manifesto is cheap. Anyone can declare a better world and feel moral. So instead of a manifesto, the site is a measurement.

It shows a single year — first light, the year the first place on Earth might plausibly govern this way. Right now it reads around 2051. But that number isn’t a wish; it’s computed, from one assumption stated in plain sight: the infrastructure of verified governance has been built since the world went online in 1991, and continues at the average pace it has held since. A set of dials — open data, civic tech, verification at platform scale — each scored 0–1 from a named source, weighted, extrapolated. Change the assumption and the number moves: pace the same dials from Athens in 508 BC instead, and dawn lands near the year 3860. So the assumption is the lever, which is exactly why it’s published.

And the clock can run backward. There’s a Watch — a standing log of what moved the year — that deliberately files the evidence against: every transparency rollback, every deliberation experiment that failed, every force pushing dawn further out. Because a sunrise that only ever gets closer is a marketing widget. The honesty is the product. If I can’t show you the thing that would move the number the wrong way, you shouldn’t believe the number.

The first brick

Ideas this size are easy to admire and easy to never touch. So the rule I set myself is that veracracy has to cash out in things that actually run.

The first one shipped this week: VoteWatch — every roll-call vote in the European Parliament and the Slovak National Council, scraped from the public record, distilled into plain language, showing which party voted which way, with a button that asks you whether you’d have voted the same. It’s the third and fifth pillars made clickable: binding decisions carrying their evidence trail, and citizens as standing jurors rather than spectators. The gap it surfaces — between how parliament voted and how the people who answered would have — is veracracy in miniature, on real data, today.

It’s small. It’s one person’s homelab. The voting is an indicative signal, not a secured ballot, and I say so on the page. But it’s the difference between a belief and a brick, and I would rather lay one honest brick than write a beautiful manifesto.

Where this comes from

I’ll be straight about the shape of this: it’s idealistic, it’s personal, and I don’t expect to see first light. I’m a solo operator in a small town who runs a rack of servers and thinks too much about how systems stay trustworthy when no one’s watching them. Veracracy is what happened when that instinct refused to stay inside the server room.

The version of this I can defend isn’t the dream — it’s the discipline around the dream. Publish your assumption. Let the clock run backward. Cash the idea out in something real. Credit your sources. Ship the honest version, not the robust-sounding one.

A measurement with one stated assumption — not a prophecy. The clock’s at veracracy.hippotion.com; disagree with a dial and you’ve understood the point.

I Run GitOps for My Brain

Fri, 01 May 2026 00:00:00 +0000

The pattern I didn’t know I had

This week an AI agent told me something about my own systems that I’d never noticed, and it was correct: I have one favorite architecture, and I’ve built it three times.

At work: git holds Terraform code → Terraform derives the S3 buckets. Nobody clicks around in the AWS console; the repo is the truth.
In the homelab: git holds Kubernetes manifests → ArgoCD derives the cluster. Every app on my rack is a folder in a repo.
In my second brain: a vault of markdown notes → an indexer derives the search database (SQLite FTS + a link graph) that my AI tools query.

Same shape everywhere: a plain-text source of truth in git, and a machine that builds the real thing from it. Master copy, derived state. I never decided this consciously — it’s just how my hands build things now.

GitOps isn’t the git part

Here’s the thing that the third copy got wrong, and it took me embarrassingly long to see because I teach this pattern at the infrastructure layer.

“Configuration in git” existed long before GitOps. What made GitOps an actual shift was the reconciler: ArgoCD doesn’t apply your manifests once and wish you luck. It watches, continuously. When the cluster drifts from the repo, you get an OutOfSync badge, and with selfHeal enabled it puts reality back where the repo says it should be. The loop is the product. Git is just where the loop points.

My vault had no loop. If I edited a note and forgot to rebuild the index, the search results my AI agents rely on were silently stale — no badge, no error, nothing. The only protection was a rule in the repo’s agent instructions: “if files and index disagree, the files win — run the indexer.”

A policy that agents must remember. In other words: I was running Kubernetes with a sticky note on the monitor that says please redeploy after editing the YAML. I would never accept that on my cluster. My brain ran on it for months.

The fix took an afternoon

Two pieces, both boring on purpose.

exo status — the OutOfSync badge. The indexer now stores a content hash per note; status re-hashes the vault and diffs:

{
  "status": "OutOfSync",
  "modified": ["vault/10-notes/interests-themes.md"],
  "new": [],
  "deleted": [],
  "repair": "exo index"
}

Exit code 0 when synced, 1 when not — so scripts and CI can ask the question too, exactly like argocd app get.

Git hooks — the selfHeal. Versioned hooks (core.hooksPath .githooks) on post-commit and post-merge rebuild the index after every commit and pull:

command -v exo >/dev/null 2>&1 || exit 0
EXO_ROOT="$(git rev-parse --show-toplevel)"
exo index >/dev/null 2>&1 && echo "exo: index reconciled (Synced)"

Now every git commit in the vault prints exo: index reconciled (Synced) on its way out. The rule didn’t change — files win — but it stopped being something agents must remember and became something a machine enforces. That’s the entire difference between configuration management and GitOps, replayed at the knowledge layer.

The part where it gets a little strange

The reason I’m writing this post at all: I didn’t have this idea. A scheduled agent did, on what I can only describe as an idle walk.

My vault has a weekly cron job — we call it the Wanderer — that samples pairs of notes that are far apart: different folders, different months, almost no shared vocabulary. A headless Claude gets the pairs with exactly one task: read both notes in full and say whether anything genuinely connects. “Nothing connects” is a successful run. That last sentence is load-bearing — the run always reports its result either way, so the agent never needs to manufacture a finding to have done its job.

On its very first walk, it collided a work note about Terraform-driven S3 provisioning with the architecture map of the vault itself, and wrote: same sentence in different clothes — and the brain copy is missing its reconciler. Then it listed the two fixes you just read about.

Retrieval answers the questions you ask. Distant collisions surface the questions you didn’t know you had. It turns out my second brain didn’t need to get better at remembering — it needed to occasionally interrupt me.

If you keep a vault

Whatever your stack — Obsidian, org-mode, a folder of markdown — if anything derives from your notes (an index, embeddings, a published site), then you have source of truth and derived state, and the GitOps question applies: who notices when they drift? If the answer is “I do, hopefully,” you’re running the sticky-note era. Give it a badge and a loop. It’s an afternoon.

🚩 I Built a Usage Dashboard and Tripped Claude Fable 5's Safety Net

Fri, 24 Apr 2026 00:00:00 +0000

The thing I was actually building

I wanted a small web page on my homelab that shows my Claude usage — the 5-hour session window, the weekly limits, the per-model split. There’s a nice Electron widget out there that does this on the desktop, but I don’t want a desktop app; I want a URL behind my own OAuth that I can glance at from my phone.

The mechanics are unremarkable. The claude.ai web app reads those numbers from a couple of undocumented endpoints using your logged-in session cookie. So a self-hosted version does the same thing server-side: hold the session token as a secret, replay the same calls, cache the result, render some bars. An afternoon’s work. I was pairing with Claude Fable 5 on it — Anthropic’s newest model, and the one that ships with extra safety measures around dual-use capability.

Then, partway through, I got the message: Fable 5 flagged something in this session and switched to a more conservative model. It dropped me to Opus 4.8 for the rest of the conversation. Safe conversations sometimes trip it, the notice said. Send feedback.

I wasn’t doing anything wrong. That’s the interesting part.

My first reaction was the obvious one — what did I say? But I knew exactly what I’d built, and none of it was sketchy. It was my account, my usage data, my hardware, my OAuth in front of it.

So I went looking at the request the way a classifier would — not “what did he mean” but “what does this look like.” And from that angle it’s a different picture entirely. Stack up the surface features:

🔑 capturing a session token and storing it to replay later
🌐 sending it to an undocumented API that isn’t meant for third parties
🕵️ spoofing a browser User-Agent so the request blends in
🧱 detecting and working around a Cloudflare bot challenge

Read that list cold, with no context. That’s not a usage dashboard. That’s the exact signature of credential theft and scraping tooling. Every individual move is one a malicious script would also make. The only thing separating my afternoon project from the bad version is whose account it touches and why — and intent is precisely the part that doesn’t show up in the tokens.

Surface vs. intent

This is the part worth sitting with, because it’s not a Claude quirk — it’s the shape of every content classifier, every WAF rule, every fraud model I’ve ever run in production.

A detector scores what it can see. It cannot see intent; it sees features. And the features of “monitor my own usage” and “harvest someone else’s session” overlap almost completely, because the technique is identical — the difference lives entirely in context the model has been deliberately built not to over-trust. You can’t tune that gap away. You can only pick where to sit on the precision/recall curve, and Fable 5 — being the high-capability model with the extra dual-use measures bolted on — sits where it catches the pattern even when it costs some false positives, then hands off to Opus 4.8. I was the false positive. The system did roughly the right thing for roughly the right reason; it just doesn’t feel that way when it’s pointed at you.

The honest engineering takeaway is the one I keep relearning: if a benign task has the silhouette of an abusive one, expect to get treated like the silhouette. Not just by AI — by rate limiters, by bot detection, by the fraud team. The fix isn’t to be offended. It’s to recognize the silhouette, and where it matters, make the legitimate context legible up front.

What I’d do differently

Practically, very little — the project was fine, and it downshifted to a model that finished the job. But the framing changed how I built it. I leaned harder into the parts that make intent visible in the design: the session token never leaves the server, it lives in Vault and arrives as an injected secret, the whole thing sits behind OAuth, and it polls on a leash instead of hammering. Not because a classifier made me, but because those are the same choices that make it obviously a personal dashboard and not a harvesting bot — to a reviewer, to future-me, and yes, to a model reading over my shoulder.

The widget rides your credential on your desktop. Mine keeps it server-side behind my own front door. Turns out building it the trustworthy way and building it the legibly trustworthy way are the same work — and getting flagged is what made me notice the difference.

🎙️ Cloning My Own Voice for My Kid's Audiobooks

Fri, 13 Mar 2026 00:00:00 +0000

The problem nobody sells a fix for

My kid loves audiobooks. The commercial platforms barely carry Hungarian children’s books, and none of them carry the one narrator my kid actually prefers: me. I can’t read aloud every evening — but my homelab doesn’t have that excuse.

The platform half (ebook → M4B → Audiobookshelf on k3s) is a story for another post. This one is about the voice: how to go from a phone recording to an audiobook narrated in your own voice, step by step, on hardware with no GPU.

The short version: XTTS-v2 does zero-shot voice cloning from a ~20-second sample. No training, no fine-tuning, no dataset. One clean recording and a flag.

Why XTTS-v2, in 2026?

It’s not the best open TTS model anymore. Chatterbox beats ElevenLabs in blind tests; F5-TTS sounds cleaner. But model selection for a small language is constraint-first, not leaderboard-first: Chatterbox has no Hungarian, NVIDIA’s TTS NIMs have no Hungarian, Kokoro — no Hungarian. XTTS-v2 speaks Hungarian and clones voices and runs on CPU. That intersection has exactly one resident.

I run it via ebook2audiobook, which wraps XTTS with Calibre ingestion and M4B chaptering.

Step 1 — Record ~25 seconds of yourself

Phone voice-memo app, quiet room, ~20 cm from your mouth. Mine came out as 28 seconds of stereo 48 kHz AAC. Two rules that matter more than gear:

Read the way you want the books narrated. The clone copies prosody — energy, pacing, warmth — not just timbre. A flat recital clones into a flat narrator. I read a children’s tale the way I’d read it at bedtime.
Don’t peak the mic. My sample hit −0.1 dB max volume — right at the clipping ceiling. It worked, but quieter is safer. Check yours:

ffmpeg -i janos.m4a -af volumedetect -f null - 2>&1 | grep volume
# mean_volume: -21.4 dB   ← fine
# max_volume:  -0.1 dB    ← living dangerously

Step 2 — Normalize to what XTTS wants

XTTS expects a mono WAV; 24 kHz matches its internal rate. Trim the silence off both ends while you’re at it:

ffmpeg -i janos.m4a \
  -af "silenceremove=start_periods=1:start_threshold=-45dB:start_silence=0.2,\
areverse,silenceremove=start_periods=1:start_threshold=-45dB:start_silence=0.2,\
areverse" \
  -ar 24000 -ac 1 janos.wav

(The double-areverse is the classic trick: silenceremove only trims the front, so you flip the audio, trim the front again, flip it back.)

Drop the result where your TTS stack looks for voices. In ebook2audiobook that’s the voices/ tree, organised by language:

voices/hun/adult/male/janos.wav

Step 3 — Synthesize

One flag does the cloning. Headless run on the k3s pod:

kubectl exec -n web-audiobooks deploy/ebook2audiobook -- sh -c \
  'cd /app && python app.py --headless \
     --ebook "/app/ebooks/tale.txt" \
     --language hun \
     --tts_engine xtts \
     --device cpu \
     --voice /app/voices/hun/adult/male/janos.wav \
     --output_format m4b \
     --output_dir /app/audiobooks'

On my 12-core CPU node this runs at roughly 3× real-time — a 2-minute tale takes ~8 minutes, a full children’s book is an overnight job. The first run computes speaker latents from your WAV; after that it’s ordinary synthesis with your voice as the reference.

Step 4 — A/B before you batch

Render one short book twice — stock narrator and cloned voice — and put both in front of the household jury. Cloning quality is personal in the most literal sense: MOS scores won’t tell you whether it sounds like you. My benchmark has strong opinions and goes to bed at eight.

Only after the clone passes do you re-render the library with --voice.

The manual steps that earn the word “manual”

Things the tutorials skip, learned the slow way:

Long conversions die with the browser tab. Gradio-style web UIs tie the job to the open page; close the laptop and you get “Conversion cancelled” half a book in. Anything longer than ~15 minutes of audio runs headless under nohup.
CPU synthesis leaks memory over hours. My pod has a hard 6 Gi limit on a 16 Gi node, and a 6-hour run will hit it. Keep the cap (it protects the other 30 namespaces), and rely on the tool’s --session resume — it picks up at the exact sentence. One catch: headless resume still asks an interactive Resume? [y]es — pipe echo y | into it.
The per-chapter FLACs survive a crash. If the final M4B muxing step OOMs, don’t re-synthesize: the chapters are sitting in the session’s tmp directory, and ffmpeg will assemble them into a chaptered M4B with a hand-written FFMETADATA file in about two minutes, at near-zero memory.

None of this is hard. It’s just undocumented — which is the gap between “there’s a model for that” and your kid pressing play.

Postscript: the jury came back

The clone failed. Recognizably my timbre, nowhere near natural — I wouldn’t play it to my kid, which is the only metric that exists for this project.

Worth being precise about what failed: the stock XTTS-v2 narrator passed the ear test and the library keeps growing with it. Zero-shot cloning is the part that fell short — a 2023 model conditioning on 26 seconds of a voice it has never seen, in a language that was never its strong suit. The pipeline above is still the right pipeline; the model isn’t there yet on CPU-class options.

The next experiment is already picked: F5-TTS Hungarian, a 2026 fine-tune on 280 hours of actual Hungarian speech, built precisely for short-sample cloning. It needs CUDA, which my node doesn’t have — but a rented spot GPU tests it for the price of an espresso. If it passes the bedtime jury, that’ll be its own post.

Negative results are results. The jury reconvenes when the GPU shows up.

🌱 My Second Brain Weeds Itself Now

Fri, 27 Feb 2026 00:00:00 +0000

A few weeks ago I rebuilt my second brain as a folder of markdown in git — vault is the source of truth, everything else (search index, graph, 3D viewer) is a derived layer I can delete and rebuild. I love it. But a knowledge base has a dirty secret: it rots.

Not the files — those are fine. The connections rot. You capture a note at 11pm and never link it to anything, so it becomes an orphan floating off the graph. A project note’s one-line summary describes what the project was three weeks ago. Two notes are obviously about the same thing and neither knows the other exists. Do this for a few months and you don’t have a second brain, you have a junk drawer with good search.

The honest fix is to weed the garden regularly. The honest truth is that nobody does, including me.

So I stopped relying on myself and built a gardener.

What it actually does

Every night at 3am, on my homelab box, a script runs:

Detect — exo garden, a plain query over the index, produces a report: here are the orphans, here are notes that should probably link to each other, here are summaries that look stale. No AI in this step. It’s SQL and graph traversal. Deterministic, boring, trustworthy.
Decide and write — that report gets piped to claude -p (Claude Code in headless mode). Claude reads the vault’s operating contract, makes only high-confidence edits — add a [[wikilink]] between two genuinely related notes, refresh a stale summary — caps itself at ~10 notes a night, and writes a dated log note explaining exactly what it changed and what it deliberately skipped.
Commit — the wrapper reindexes and lands everything as a single garden: 2026-06-09 … git commit, then pushes. My 3D graph viewer picks it up on the next sync.

The first real run, it found one orphan (90-meta/README), linked it into the notes it actually indexes, and then — this is the part I liked — declined to touch the 12 “stale summary” candidates because, on inspection, every one of them was already accurate. It wrote: “flagged by length, not staleness; churning them would add noise.” A gardener that knows when not to prune is the one you can leave alone.

“Isn’t this a solved problem?”

Mostly, no — but partly, yes, and I want to be straight about it. AI-assisted note-linking exists: Obsidian plugins like Smart Connections suggest related notes, and apps like Mem and Reflect auto-organize as you write. They’re good.

Three things make this different enough to build:

Every change is a reviewable git diff, authored by a named agent. Not silent magic that rearranges your notes while you’re not looking. git log -p shows you exactly what the gardener did last night; git revert undoes a bad night in one command. For something as personal as a knowledge base, “show me the diff” beats “trust me.”
It’s mine, end to end. Runs on my hardware, on my schedule, with a model I point at. No SaaS holds my brain hostage.
The detection is deterministic; the model only acts. The LLM never decides what’s wrong — a boring query does that. The model only decides how to fix the things already found. That split keeps the whole thing auditable and cheap.

If you already live in a tool that does this and you trust it, great. I wanted the git-diff trail and the local control.

The part I actually want to tell you about

The plan was tidy: I run n8n on the same cluster, so n8n would be the scheduler — fire nightly, SSH into the node, run the gardener. Clean, visual, one workflow.

n8n could not reach the node. At all. Every port: ECONNREFUSED.

This sent me down a genuinely interesting hole, because the homelab runs Cilium for networking, and Cilium has opinions about your own node that plain Kubernetes does not.

First instinct: a NetworkPolicy allowing egress to the node’s IP. Wrote it, synced it, still refused. The reason is a Cilium subtlety worth knowing: the node isn’t a CIDR, it’s an identity. Cilium classifies your cluster’s own node as the special host identity, and ordinary ipBlock CIDR rules do not match it unless you flip a cluster-wide setting (policy-cidr-match-mode: nodes). My 192.168.0.109/32 rule was a no-op.

So I switched to the Cilium-native tool: a CiliumNetworkPolicy with toEntities: [host]. Confirmed it applied — I could see reserved:host allowed right there in the datapath’s BPF policy map. I confirmed the node’s IP really does resolve to identity 1 (host). I confirmed the host firewall was disabled. Everything said “allowed.”

Still ECONNREFUSED.

That’s the wall. The packet leaves the pod with Cilium’s blessing, hits the host’s own network stack, and something there sends a reset — and I couldn’t see what, because inspecting the host firewall needs root, and this automation deliberately doesn’t have it. I could have kept digging with a password. But I stopped and asked a better question: why am I making a pod reach back into the host it’s running on at all?

That’s an awkward direction. The work has to happen on the host (that’s where the vault, git creds, and Claude live). A pod straining to SSH into its own node is fighting the grain of the platform.

So I inverted it. The node schedules itself — a plain cron entry, rock-solid, no network gymnastics. And n8n, instead of triggering the job, receives it: at the end of each run the node POSTs a summary to an n8n webhook. Node→n8n works perfectly (it’s just an outbound HTTPS call to a URL). n8n keeps the run history and is the place I’ll later wire a phone notification.

I lost nothing that mattered. n8n is still my dashboard; the schedule just lives where the work lives. And I deleted the SSH key and the network-policy hole I’d opened — the cleanup felt better than the original plan would have.

The lesson, such as it is

Two, actually.

One: when you’re automating something to run unattended, the bug you want to find is the one that shows up in a dry run at 2pm, not at 3am three weeks from now. I almost shipped a version where a brand-new note (untracked by git) was invisible to my change-detection and would’ve been silently wiped each night. The dry run caught it. Always build the dry run.

Two, the bigger one: I spent an hour trying to make a pod punch into its host because that was my plan, and the platform kept saying no in increasingly specific ways. The fix wasn’t a cleverer NetworkPolicy. It was noticing I was pushing against the design and turning around. The node scheduling itself and reporting up to n8n is simpler, safer, and more honest about where the work actually lives.

My brain weeds itself now. Every morning there’s maybe one small, sensible commit waiting — a link I’d have never made, a summary nudged back to true — and I can read exactly what changed before my coffee’s done. That’s the whole dream of a second brain that isn’t a junk drawer: it stays a garden, and I barely have to touch it.

🎯 Know the Market Without Job-Hunting: An LLM-Scored Job Poller in n8n

Fri, 13 Feb 2026 00:00:00 +0000

You don’t have to be about to change jobs to want to know the landscape. What’s being built, what it pays, where you’d actually fit — staying current on the market (and your own worth) is just good professional hygiene. The trouble is that checking is tedious, so most of us don’t, until we’re already job-hunting and starting cold.

So I automated mine. An n8n workflow on my homelab polls job boards every six hours, scores each new posting against my profile with an LLM, and emails me only the strong matches — the ones scoring 80%+. When it’s quiet, it’s silent. When something genuinely fits, I know the same day. Here’s what I learned building it. Repo at the bottom.

Three APIs cover most of the market

Company career pages look bespoke, but underneath, the vast majority run on one of three ATS — and all three hand you the jobs as unauthenticated JSON:

Greenhouse — boards-api.greenhouse.io/v1/boards/{token}/jobs?content=true
Lever — api.lever.co/v0/postings/{token}?mode=json
Ashby — api.ashbyhq.com/posting-api/job-board/{token}?includeCompensation=true

No scraping, no headless browser. You poll the API the page itself calls, normalize the three shapes into one { company, title, location, remote, url, posted_at, description, external_id }, and you’re done with the hard part.

“Resolve the token” is half the battle

The naive assumption — the token is the company name, and everyone’s on one of the three — is half right. When I probed my initial wishlist, roughly half 404’d everywhere: HashiCorp (now under IBM → Workday), SUSE (SuccessFactors), Aiven (Teamtailor), Hugging Face. They’re on a fourth or fifth system entirely. The honest move was to ship the ~33 that actually resolve and leave the rest as disabled config stubs. Verify before you trust a slug.

Dedup without a database

I didn’t want to stand up Postgres just to remember which jobs I’d already seen. n8n’s Data Tables handle it natively: a seen_jobs table, an external_id namespaced {ats}:{company}:{id}, and the rowNotExists operation drops anything already recorded. State lives inside n8n, backed up with it. Zero extra infrastructure.

The ordering matters: notify first, mark seen second. The insert only happens after the email sends, so a failed send retries next run instead of silently swallowing a posting.

The location filter is a trap

My first version kept everything that wasn’t explicitly US-based. The inbox filled with “Senior Platform Engineer — Spain (Remote)” and "… — United Kingdom (Remote)". Those aren’t remote-for-me — they’re remote if you live in Spain. Useless from where I sit.

The fix was to invert the logic. Keep only three things:

globally-remote / worldwide / anywhere,
pan-EU (EMEA / Europe / EU / EEA),
my own country.

…and drop single-country remote, even EU ones. Region and home matches win over the country deny-list, ambiguous locations are kept (a missed match is worse than one extra line to skim). That one change cut the noise more than anything else.

Let an LLM read the actual job

Keyword + location filtering gets you a candidate list, but it can’t tell a “Platform Engineer” who herds Kubernetes from a “Platform Engineer” who owns a Figma design system. The job description can.

So the last step scores each new posting against my CV. My first version batched all of them into one big LLM call — which promptly timed out on the free tier. The fix was the opposite: one small call per job, which also means a single slow or rate-limited job never sinks the batch. Each call asks a NVIDIA NIM model (Llama 3.1 8B, OpenAI-compatible) for one number and a reason:

Score this job 0–100 for fit against my profile. Return {score, reason}.

That score is what lets me widen the net instead of narrowing it. On top of the curated company list I pull a broad remote-jobs feed (every company, all categories); the cheap keyword + location filters do the first pass, then I only email the roles scoring 80%+. Casting wide is fine when a model is the bar at the door. A line ends up looking like:

92% — Grafana Labs — Senior Platform Engineer (Remote, EMEA) — strong k8s/GitOps overlap — link

Scoring is fail-safe: if a call hiccups, that job is just skipped, and every posting gets marked seen either way — so nothing re-scores forever, and a rare bad run never floods or stalls the inbox.

The unglamorous bits that make it trustworthy

One bad source can’t kill the run — every fetch is wrapped; failures become a ⚠️ N sources failing footer so a company quietly changing ATS is visible, not invisible.
A prime run seeds the table silently the first time, so I’m not buried under every currently-open role on day one.
Everything tunable lives in one Config node — companies, keywords, location lists, the profile, the model — so adding a company is a one-line edit, not a graph safari.

Takeaways

The “scrape job boards” problem mostly isn’t a scraping problem — it’s three public APIs and a normalizer.
For personal automation, reach for the boring-but-correct primitive: native dedup state beats a database you have to operate.
An LLM works best here as the bar at the door: cheap deterministic filters keep the candidate set (and the cost) small, then the model gates on real fit — which is what lets you cast a wide net without drowning in it.

Workflow JSON, the full node-by-node breakdown, and setup notes: github.com/janos-gyorgy/ats-job-poller.

🧠 A Second Brain You Can `git clone`

Fri, 16 Jan 2026 00:00:00 +0000

The graveyard of second brains

I had a second brain once. Obsidian vault, a CouchDB LiveSync backend, even a weekly agent that summarised my notes. It worked — for a while. Then the sync started fighting itself across my laptop, the homelab, and my phone, and the day syncing becomes a chore is the day you stop opening the thing. The notes were still there. I just never looked at them again.

That’s how most second brains die. Not from bad notes — from the plumbing. The sync breaks, or the upkeep outpaces the payoff, or the whole thing is trapped in one app’s database and moving it feels like surgery. The knowledge was never the problem. The container was.

So when I rebuilt it, I started from the failure modes, not the features.

What I actually wanted

Three things, none of them “more notes”:

Memory I share with my AIs. Every time I open a fresh Claude session, it starts from zero — I re-explain my homelab, my projects, what we decided last week. I wanted a place both of us read and write, so the context survives the session.
Something that outlives any tool. No lock-in. If the app of the month dies, my brain shouldn’t die with it.
Sync that can’t rot. The thing that killed v1.

The one decision that matters

The store and the intelligence are different layers, and only the store is sacred.

The store is a folder of plain markdown in git. That’s it. Human-readable, diffable, greppable, yours. Everything clever sits above it and is fully rebuildable:

L5  Visualisation   3D graph, Obsidian, whatever reads markdown
L4  Automation      scheduled "gardener" runs
L3  Agent interface MCP servers — search, graph, note CRUD
L2  Index           SQLite: full-text + vectors + materialised edges
L1  Structure       typed frontmatter + [[wikilinks]]
L0  Substrate       markdown files in git   ← the only thing that's truth

Delete L1–L5 and nothing is lost — you rebuild them from L0 with one command. That property is the whole design. The index can corrupt, the embedding model can change, the viewer can break (mine did, spectacularly — that’s another post), and the knowledge doesn’t care. It’s text in git.

And sync is just git pull. No LiveSync daemon to wedge itself, no proprietary replication. The exact thing that killed v1 is now the most boring, battle-tested part of the stack. Three devices, one git pull, done.

Search that explains itself

The retrieval layer is deliberately not “throw it all at embeddings.” It fuses three signals — keyword (BM25), vector similarity, and graph expansion (pull in the neighbours of strong hits) — and every result reports which signals fired.

exo search "hybrid retrieval"
→ hybrid-retrieval   matched_on: [bm25, graph]

That matched_on matters more than it looks. An embeddings-only system gives you a ranked list and no reason — you can’t tell a real match from a vibe. For a brain I’m supposed to trust over years, “why did this surface?” is a feature, not a nicety.

The AI is a librarian, not a hoarder

Here’s the part I care about most. The AI doesn’t just read the brain — it writes to it. Through an MCP server it can search, walk the graph, and author notes. But under a hard rule: every write is a reviewable git diff.

It searches before it writes (extend a note, don’t spawn a duplicate). It links instead of piling. A scheduled “gardener” pass finds orphaned notes and stale summaries and proposes fixes — as commits I can read and git revert if it gets something wrong. No black-box mutation of my memory. Just a librarian that files things while I’m asleep and leaves a paper trail.

So now “what am I building?” is a question with an instant, honest answer: a single map note, kept current, that every project links into. I ask, the AI pulls it, and neither of us has to remember.

Why not just…

Obsidian alone? It’s a lovely viewer — and I still use it as one. But it can’t give an agent structured read/write or explainable retrieval, and its sync is what burned me. Here Obsidian reads the same markdown; it’s a window, not the house.
Embeddings RAG? Opaque and one-directional. It can rank, but it can’t tell you why, and it can’t write back. This is transparent and bidirectional.
Notion / a SaaS brain? Lock-in by design. git clone is my backup and any text editor is my fallback.
A graph database? Unnecessary infra. The graph lives in the wikilinks; SQLite just materialises it. I’ll add Neo4j the day my queries actually outgrow a single file, and not a day sooner.

What it changes

The vault is small still — that’s fine; it grows by use. But the loop already pays off: I work, the AI checkpoints decisions into markdown, and the next session — fresh model, no memory of its own — searches the brain and is caught up in seconds. The knowledge stopped living only in my head and in dead chat logs.

I’m a team of one. There’s no colleague who remembers why I made a call six months ago, no handover doc someone else maintains. Continuity isn’t a nice-to-have; it’s the whole job. A second brain that the AI helps keep alive — and that I can git clone onto any machine in thirty seconds — is the first version of this idea that I actually trust to still be here in five years.

The notes from v1? They’re sitting in a folder, waiting to be triaged into v2. This time I’ll still be opening it.

🍵 I A/B-Tested Cloud vs Local LLMs in One n8n Agent. The Local One Faked It.

Fri, 07 Nov 2025 00:00:00 +0000

The question

I run n8n on my k3s homelab. Not docker-compose on a NUC — the full treatment: GitOps-reconciled, Vault-backed secrets, default-deny networking. The same boring platform everything else here runs on.

But “I have n8n running” proves nothing. I wanted to know if I actually understood it as an agent platform, and to answer a question I kept dodging: for agent work, do I need a cloud model, or is my local one good enough?

So I built a real agent and gave it two brains.

What I built

A chat assistant over brew-buddy, my homemade kombucha-tracking app (React + a small API + Postgres). You ask it things in plain language; it calls the app’s API and answers. The twist: the same question runs through two agents in parallel — one backed by NVIDIA’s hosted Llama-3.3-70B, one by a local Phi-3.5-mini on CPU — and the workflow prints both answers side by side.

Chat ──▶ Agent (cloud: NVIDIA 70B) ──┐   tools (shared):
     └─▶ Agent (local: Phi-3.5)   ──┤     • get_all_batches
                                    │     • get_batch_detail
                                    │     • brewing_statistics
            (Merge) ──▶ both replies, labeled     • add_batch_log   ⟵ write
                                                  • create_batch    ⟵ write

Both agents share the same read tools. The two write tools are wired to the cloud agent only — more on that below.

The nice part: I didn’t write a line of glue. n8n’s stock OpenAI Chat Model node talks to anything OpenAI-compatible if you override the credential’s Base URL — so one node points at https://integrate.api.nvidia.com/v1, the other at http://llama-server..svc:8080/v1 for the local server. Same node, two endpoints.

The infra that keeps it honest

I won’t re-explain the platform here — it’s in earlier posts: GitOps, Vault-backed secrets, default-deny networking, dual-path TLS ingress. But building the agent made one of them tangible.

n8n is, by design, a thing that makes arbitrary HTTP calls on a schedule. That’s exactly what you want behind a default-deny network policy. n8n couldn’t reach the brew-buddy API at all until I declared it — one line:

# n8n's namespace
allowEgressToNamespaces: [web-ai-engine, web-brew-buddy]
#                                          ^ added this for the agent

(plus a matching ingress-allow on brew-buddy’s side). That’s the posture working as intended: the blast radius of a workflow tool is whatever I’ve explicitly granted, and not one namespace more. Adding a capability is a reviewable one-liner in Git; Argo reconciles it. No kubectl, no guessing what n8n can reach.

The A/B: same agent, same tools, two brains

Plain “hi”. Cloud answers in ~0.5s. Local takes noticeably longer — because even for “hi”, the agent feeds the model the full system prompt plus the JSON schemas for every tool, and Phi-3.5 has to chew through all of it on CPU before it can say a word. So far, the boring expected result: local is slower.

Then I asked a real question, and the result flipped in a way I didn’t expect.

“What batches do I have?”

Cloud (70B) called get_all_batches, got the real rows, and answered:

You have two batches: 2026-04-09-A (cold-crash, 3L) and 2026-04-09-W (cold-crash, 3L).

Local (Phi-3.5) never called the tool. It didn’t seem to realise it had tools. Instead it confidently explained how I could go find the data myself:

To list all batches: 1. Access the brew-buddy app. 2. Look for a button labeled “List Batches”… def get_all_batches(): … … Remember, I’m unable to directly interact with apps or databases.

Fake instructions. Fake code. A polite apology. Everything except the actual answer it was sitting on top of.

Writing data. I asked both to log an observation. Cloud called add_batch_log and wrote a real row to Postgres (“I have recorded the observation…”). Local bluffed again — “here’s how you can log it yourself.”

Why it matters: capability, not latency

The interesting finding isn’t “the big model is better.” It’s how the small one fails.

With a ~3.8B model on CPU, the bottleneck for agent work isn’t speed — it’s capability. Phi-3.5 couldn’t reliably emit tool calls, so n8n’s tools never fired, and the model degraded into a chatbot that hallucinates a plausible answer instead of fetching the real one. That failure mode is worse than an error: an error you catch, a confident wrong answer you ship.

A couple of measurements that sharpened it:

NVIDIA 70B, plain chat: ~0.5s.
NVIDIA 70B, function-calling (with tool schemas): ~8.6s per round-trip — and an agent makes several round-trips per answer. That’s real latency you have to budget a timeout for. (It’s also why the cloud side initially timed out in n8n until I raised the model node’s timeout — the model was fine, n8n was cutting it off.)

So the snappy-vs-slow comparison flips depending on whether the question triggers tools. Plain chat: cloud wins on speed. Tool use: the local model is “fast” only because it skips the tools and makes something up. Speed was never the real axis.

The honest caveat: this is this small general model in a multi-tool agent loop. Purpose-built small models with tool-calling fine-tunes do better at narrow tasks — I run a 1.7B one elsewhere that emits a single structured tool call just fine. But for “pick the right tool from several and chain them,” 70B was in a different league.

The trust boundary

I gave the write tools (add_batch_log, create_batch) to the cloud agent only. The local agent is read-only — not by instruction, by wiring. Even if Phi-3.5 did decide to call a write tool, the connection isn’t there. The reliable model is the only one allowed to mutate real data, and that’s enforced structurally, not by trusting a prompt.

What’s toy and what’s real

Worth being straight: this is a single-node homelab. The agent and both model paths share one box. Running n8n on Kubernetes and swapping models isn’t novel — n8n’s own docs cover queue mode, where a main instance fans work out to a pool of worker pods you scale horizontally, with external Postgres for state. That’s the real production shape. Mine is one replica with an emptyDir’s worth of ambition.

What I think is worth sharing is the finding (the capability cliff, and that its failure mode is confident fabrication) and the boring thing underneath it: because the platform is default-deny and GitOps-reconciled, running this experiment cost me one reviewable egress line and zero risk to anything else.

The boring part is the point

The AI was the fun bit. But the reason I could bolt an agent onto a live cluster, point it at a real app, give it write access to one model and not the other, and tear it all down again — without worrying what it might touch — is that the infrastructure was already boring. Default-deny. Secrets out of Git. git push, Argo reconciles.

The model picks the tools. The platform decides what the tools can reach. Keep those two honest about each other and self-hosting an agent stops being scary and starts being just another app.

🔒 Building a PII Guardrail Proxy for Cloud LLM Calls

Fri, 26 Sep 2025 00:00:00 +0000

The problem with cloud LLM access

Running a local model is great for privacy. But local models hit a ceiling — for the heavy lifting, you want a cloud API like NVIDIA NIM with Llama 3.3 70B.

The moment you open that channel, you have a new risk: what if someone (or some automation) accidentally pastes a password, a private key, or someone’s personal data into the chat? It leaves the cluster. It’s logged somewhere you don’t control.

The standard answer is “train your users.” I’d rather have a technical control.

The architecture

Open WebUI → ai-guard proxy
                 │
        ┌────────┴────────┐
        │                 │
  llama-server       if SAFE:
  (classify)         forward to NVIDIA NIM
        │
   if SENSITIVE:
   block + explain

Every request to NVIDIA NIM goes through ai-guard first. ai-guard pulls the user message, sends it to the local llama.cpp server with a classification prompt, and makes a binary decision:

SAFE → forward to NVIDIA NIM with the real API key (which ai-guard holds, not the client)
SENSITIVE: → return HTTP 400, log the block, nothing leaves the cluster

The local model is already running for inference — this reuses it as a privacy gatekeeper at zero extra infrastructure cost.

The implementation

The proxy is ~150 lines of FastAPI. The classifier call:

CLASSIFIER_PROMPT = """You are a data security classifier. Check if the text below contains sensitive information:
passwords, API keys, tokens, credentials, personal identifiable information (names, emails, phone numbers, SSNs, addresses), financial data (card numbers, bank accounts), or private keys.

Reply with ONLY one of:
SAFE
SENSITIVE: 

Text to check:
"""

async def classify(text: str) -> tuple[bool, str]:
    async with httpx.AsyncClient(timeout=60) as client:
        resp = await client.post(
            f"{LLAMA_BASE}/chat/completions",
            json={
                "model": "phi-3.5-mini",
                "messages": [{"role": "user", "content": CLASSIFIER_PROMPT + text[:3000]}],
                "max_tokens": 30,
                "temperature": 0,
                "stream": False,
            },
            headers={"Authorization": "Bearer sk-no-key"},
        )
    answer = resp.json()["choices"][0]["message"]["content"].strip()
    if answer.upper().startswith("SENSITIVE"):
        reason = answer.split(":", 1)[1].strip() if ":" in answer else "sensitive content detected"
        return True, reason
    return False, ""

temperature=0 and max_tokens=30 keep the response deterministic and fast. The model only needs to output one word or one line.

The main handler:

@app.post("/v1/chat/completions")
async def proxy_chat(request: Request):
    body = await request.json()
    user_text = extract_user_text(body.get("messages", []))

    if user_text.strip():
        try:
            is_sensitive, reason = await classify(user_text)
        except Exception as exc:
            log.error("classifier error: %s — allowing request through", exc)
            is_sensitive = False

        if is_sensitive:
            return JSONResponse(status_code=400, content={
                "error": {
                    "message": f"Request blocked by ai-guard: {reason}. Remove sensitive content before sending to external models.",
                    "type": "content_policy_violation",
                }
            })

    # Safe — forward to upstream with streaming support
    ...

Fail-open: if the classifier itself errors (llama-server down, timeout), the request goes through and the error is logged. Fail-closed would be safer for high-stakes environments, but this is a homelab and I’d rather not block all cloud LLM access because the local model is warming up.

Kubernetes deployment

ai-guard runs in the same namespace as llama-server and Open WebUI (web-ai-engine). Intra-namespace traffic is always allowed in Cilium, so no new network policy needed.

Open WebUI uses semicolon-separated lists for multiple API backends:

- name: OPENAI_API_BASE_URLS
  value: "http://llama-server.web-ai-engine.svc:8080/v1;http://ai-guard.web-ai-engine.svc:8080/v1"
- name: OPENAI_API_KEYS
  value: "sk-no-key;sk-no-key"

The second entry is ai-guard. Open WebUI passes sk-no-key as the API key — ai-guard ignores it and uses its own UPSTREAM_API_KEY from a Kubernetes Secret (pulled from Vault via External Secrets Operator). The real NVIDIA API key never touches the client.

The latency tradeoff

The classification step adds 5–15 seconds on CPU inference. That’s the cost of keeping the check fully private — the classifier never sends data anywhere.

For a personal homelab assistant, this is fine. For a high-throughput production setup, you’d want the classifier on a GPU or a dedicated smaller model purpose-built for classification.

What it catches

The classifier prompt targets:

Passwords, API keys, tokens, credentials
PII: names, emails, phone numbers, SSNs, addresses
Financial data: card numbers, bank accounts
Private keys

False negatives are possible — no classifier is perfect. This is a first line of defense, not a compliance control. The value is catching the obvious, accidental leaks.

Source

github.com/janos-gyorgy/ai-guard — MIT licensed, Kubernetes manifests included.

🕵️ Privacy-Preserving LLM Pipelines: Anonymize Before You Send

Fri, 12 Sep 2025 00:00:00 +0000

The problem with blocking

The PII guardrail proxy I built last week works by classifying prompts and blocking the sensitive ones. That’s fine for a chat interface where a human can rephrase. It doesn’t work for automated pipelines.

If a Jira ticket contains someone’s name and an internal hostname, you don’t want the agent to fail — you want it to process the ticket without exposing that data. Blocking is the wrong primitive for pipelines. Anonymization is the right one.

The pattern

Input text
  → anonymizer: extract PII, replace with semantic fakes
  → "Nathan Chen from DataSoft LLC needs ProjectX fixed on dev.internal.net"
  + mapping: {"Nathan Chen" → "John Smith", "DataSoft LLC" → "ACME", ...}
  → cloud LLM: processes coherent text, never sees real values
  → "Nathan Chen should check the ProjectX docs with the DataSoft LLC team"
  → string substitution with reverse mapping
  → "John Smith should check the OAuth docs with the ACME team"

Two things that make this work:

Deanonymization needs no LLM. Once you have the mapping, restoring is pure string substitution. The model call only happens on the way in.

Semantic fakes beat placeholder tokens. An earlier version of this used [PERSON_1], [ORG_1] tokens. The problem: cloud models see bracketed text and subtly change behaviour — shorter responses, hedging, dropped context. When the cloud model sees Nathan Chen from DataSoft LLC, it treats it as real text and responds naturally. Quality is noticeably better.

Prior art — what already exists

This is a well-established pattern. Worth knowing what’s out there:

LLM Guard (Protect AI) — the most complete open-source implementation. Anonymize + Deanonymize scanner pair with a Vault for the mapping. Production-grade, actively maintained. Start here if you’re building this for anything serious.

Microsoft PII Shield — session-based proxy. Returns a session ID with the anonymized text, uses it to deanonymize the response.

anonLLM — uses GLiNER (a proper NER model) + Faker for realistic replacements. Better accuracy than a general chat model.

REDACT — IEEE paper describing a system using Ollama for PII redaction in documents.

HuggingFace Anonymizer SLM series — purpose-built models (0.6B/1.7B/4B) fine-tuned specifically for anonymization. 9.20/10 quality score for 1.7B, close to GPT-4.1’s 9.77.

That last one is what this implementation actually uses.

The model: Anonymizer-1.7B

eternisai/Anonymizer-1.7B is a Qwen3-1.7B fine-tune trained on ~30k anonymization samples using GRPO with GPT-4.1 as judge. It outputs structured tool calls instead of free text:

{
  "name": "replace_entities",
  "arguments": {
    "replacements": [
      {"original": "John Smith", "replacement": "Nathan Chen"},
      {"original": "ACME Corp", "replacement": "DataSoft LLC"},
      {"original": "auth.acme.internal", "replacement": "dev.internal.net"}
    ]
  }
}

No prompt engineering needed. The model knows exactly what it’s doing and outputs a structured contract. Compare that to the first version of this service, which sent a long JSON-format prompt to Phi-3.5-mini and hoped the output parsed correctly.

The model runs via Ollama (which handles the Qwen3 chat template and tool calling natively), pointed at the GGUF version from HuggingFace: hf.co/gabriellarson/Anonymizer-1.7B-GGUF.

The implementation

llm-anonymizer is a FastAPI service with two endpoints.

POST /anonymize — calls Ollama with the tool definition, parses the response:

TOOLS = [{
    "type": "function",
    "function": {
        "name": "replace_entities",
        "description": "Replace PII entities with anonymized versions",
        "parameters": {
            "type": "object",
            "properties": {
                "replacements": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "original": {"type": "string"},
                            "replacement": {"type": "string"},
                        },
                        "required": ["original", "replacement"],
                    },
                }
            },
            "required": ["replacements"],
        },
    },
}]

resp = await client.post(f"{OLLAMA_BASE}/api/chat", json={
    "model": MODEL,
    "messages": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": text + "\n/no_think"},  # skip Qwen3 thinking mode
    ],
    "tools": TOOLS,
    "stream": False,
})

tool_calls = resp.json()["message"]["tool_calls"]
replacements = tool_calls[0]["function"]["arguments"]["replacements"]

# Build reverse mapping: replacement → original (for deanonymization)
anonymized = text
mapping = {}
for pair in replacements:
    anonymized = anonymized.replace(pair["original"], pair["replacement"])
    mapping[pair["replacement"]] = pair["original"]

The /no_think suffix tells the model to skip its chain-of-thought — faster response, same accuracy for this task.

POST /deanonymize — no model call, just substitution:

for replacement, original in sorted(mapping.items(), key=lambda x: len(x[0]), reverse=True):
    text = text.replace(replacement, original)

Sorted by length descending so longer tokens don’t get partially overwritten by shorter ones.

The Kubernetes stack

Ollama runs as a separate deployment in the same namespace as everything else (web-ai-engine). Intra-namespace traffic is always allowed — no new network policies.

llm-anonymizer (FastAPI) → Ollama (port 11434) → Anonymizer-1.7B GGUF

One-time model pull after first deploy:

kubectl exec -n web-ai-engine deploy/ollama -- \
  ollama pull hf.co/gabriellarson/Anonymizer-1.7B-GGUF

Ollama caches it on a 10Gi PVC, so pod restarts don’t re-download.

The n8n pipeline

Five-node chain triggered by webhook:

Webhook → /anonymize → NVIDIA NIM → /deanonymize → Respond

The NVIDIA NIM call includes a system prompt instructing it to treat the text as normal input. No mention of tokens, no special handling — because the text looks like real text.

Wire any upstream source to the webhook: Jira event, Slack slash command, a scheduled job that processes internal docs. The pipeline is source-agnostic.

The caveats

1.7B isn’t GPT-4.1. The model scores 9.20/10 on the benchmark — which means roughly 1 in 10 cases has a missed or incorrect entity. Test with real examples from your domain before depending on it.

Deanonymization breaks on heavy rephrasing. If the cloud model restructures a sentence enough that the fake value no longer appears verbatim, the substitution silently misses it. The prompt helps but doesn’t eliminate the risk.

Ollama adds a deployment. It’s ~500MB image + the model weights (~1GB Q4). On a constrained single-node cluster that’s real overhead. llama-server already covers general chat; Ollama is purely for this model’s tool-calling support.

Source

github.com/janos-gyorgy/llm-anonymizer — MIT licensed, Kubernetes manifests and n8n workflow included.

📈 Observing Local LLM Inference: llama.cpp's Built-in Prometheus Metrics

Fri, 29 Aug 2025 00:00:00 +0000

What “operating an LLM” actually means

Running a local model is easy. Understanding what it’s doing is less so.

After deploying llama.cpp + Open WebUI on k3s (previous post), I had a chat interface backed by a local model. What I didn’t have: any visibility into how the model was behaving — whether requests were queuing, how fast tokens were being generated, how much of the context window was in use.

The instinct for this kind of problem is usually “add a proxy layer.” There are several tools in this space — LiteLLM being the most popular — that sit between the client and the inference server and record token counts, latency, and spend. I tried this first. LiteLLM OOMed at startup on a node already at 76% memory. Heavy Python import tree, not a lot of headroom.

The thing I’d missed: llama.cpp ships a Prometheus metrics endpoint. No proxy required.

`--metrics`

One additional argument to the inference server:

args:
  - -m
  - /models/Phi-3.5-mini-instruct-Q4_K_M.gguf
  - --host
  - "0.0.0.0"
  - --port
  - "8080"
  - --ctx-size
  - "4096"
  - --n-predict
  - "1024"
  - --parallel
  - "1"
  - --metrics        # ← this
  - --log-disable

After restart, GET /metrics on port 8080 returns valid Prometheus exposition format:

# HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total 0

# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 0

# HELP llamacpp:requests_processing Number of requests processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0

The full set of metrics:

Metric	Type	What it measures
`llamacpp:prompt_tokens_total`	counter	Input tokens processed (cumulative)
`llamacpp:tokens_predicted_total`	counter	Output tokens generated (cumulative)
`llamacpp:prompt_tokens_seconds`	gauge	Current prompt throughput (tok/s)
`llamacpp:predicted_tokens_seconds`	gauge	Current generation throughput (tok/s)
`llamacpp:tokens_predicted_seconds_total`	counter	Total time spent generating
`llamacpp:prompt_seconds_total`	counter	Total time spent on prompts
`llamacpp:requests_processing`	gauge	Requests currently being processed
`llamacpp:requests_deferred`	gauge	Requests queued, waiting for a slot
`llamacpp:n_decode_total`	counter	Total llama_decode() calls
`llamacpp:n_busy_slots_per_decode`	counter	Slots active per decode call

These cover the metrics that matter for a personal inference server: throughput, latency (derivable from total time / total tokens), and queue depth.

Prometheus scrape config

Adding a static scrape target in the existing Prometheus configuration:

extraScrapeConfigs: |
  - job_name: llama-server
    static_configs:
      - targets:
          - llama-server.web-ai-engine.svc:8080
    metrics_path: /metrics

The only non-obvious thing here is the network policy: Prometheus lives in dashboard-homelab, and llama-server lives in web-ai-engine. With Cilium network policies enforcing namespace isolation, the dashboard namespace needs to be allowed to make inbound connections to the AI engine namespace. In applications.yml:

- namespace: web-ai-engine
  networkPolicies:
    allowIngressFromNamespaces: [dashboard-homelab]

Without this, Prometheus scrape attempts fail silently with a timeout.

Grafana dashboard via ConfigMap

Rather than importing a dashboard JSON manually through the Grafana UI, the Grafana sidecar handles it automatically. Any ConfigMap with the label grafana_dashboard: "1" is picked up, loaded, and available in Grafana — across all namespaces by default.

The dashboard ConfigMap lives in web-ai-engine, not dashboard-homelab. The sidecar finds it regardless:

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboard-llm
  namespace: web-ai-engine
  labels:
    grafana_dashboard: "1"
data:
  llm-metrics.json: |
    {
      "title": "LLM Metrics",
      "uid": "llm-metrics",
      ...
    }

Argo CD reconciles the ConfigMap. The Grafana sidecar picks it up. The dashboard appears. No manual steps, no Grafana UI interaction, no state outside Git.

This means the dashboard is version-controlled, reproducible on cluster rebuild, and consistent across environments. The same YAML that describes the app’s Kubernetes resources also describes what the monitoring looks like.

What the dashboard shows

After sending a few messages through Open WebUI:

Generation throughput — the llamacpp:predicted_tokens_seconds gauge drops to 0 between requests and spikes during generation. On this hardware (Intel N100, CPU-only inference, Phi-3.5-mini Q4_K_M), it reads 3–5 tok/s during active generation. This is the number to watch if you’re comparing models or quantisation levels.

Cumulative tokens — llamacpp:prompt_tokens_total and llamacpp:tokens_predicted_total both increase monotonically. The ratio between them is roughly the input/output ratio of your usage pattern. For conversational use it’s typically 3:1 prompt to generation; for summarisation tasks it flips.

Queue depth — llamacpp:requests_deferred is 0 almost always, which is expected with --parallel 1. If it’s consistently above 0, you have more concurrent users than the server can handle with the current slot configuration.

ms/token — derived from rate(llamacpp:tokens_predicted_seconds_total[5m]) / rate(llamacpp:tokens_predicted_total[5m]) * 1000. This is the per-token latency, which is the number that governs whether the response feels fast or slow. 200–300ms/token feels instant; above 400ms you start noticing.

What’s missing compared to a proxy layer

LiteLLM and similar proxies give you things this setup doesn’t:

Per-model routing — if you’re running multiple models, a proxy can route requests to the right one. With a single model, irrelevant.
Virtual API keys — per-user or per-application key scoping. Not needed when the whole thing is behind SSO.
Spend tracking — meaningful when you’re paying per token. For a local model, the cost is electricity, which Prometheus already covers through the power monitoring dashboard.

For a single-model homelab, the native metrics are sufficient. If I add more models later or need per-user attribution, a proxy layer becomes worth the RAM.

The pattern

The broader point is that the observable unit here isn’t the proxy — it’s the inference server itself. Scraping llama.cpp directly means the metrics survive proxy changes, backend swaps, or routing redesigns. The inference server is the thing doing the work; it’s the right place to measure.

Starter manifests with the metrics configuration included: homelab-ai-inference-starter

🤖 Local LLM Inference on Kubernetes, No GPU Required

Fri, 15 Aug 2025 00:00:00 +0000

The GPU assumption

Most write-ups about self-hosting LLMs start with a GPU. A 3090, an A100, at minimum something with CUDA. The implication is that without one you’re wasting your time — inference will be too slow to be useful.

That’s not been my experience.

I’ve been running a local LLM stack on a ThinkCentre mini PC (Intel N100, 16 GB RAM, no discrete GPU) for a few months. The model is Phi-3.5-mini-instruct, 3.8 billion parameters, 4-bit quantised. Response time is 3–6 tokens per second on CPU — slow enough that you notice it, fast enough that you use it. For the things I actually reach for a local model to do — rephrase something, summarise a document, explain a config option without sending it to an external API — the latency is fine.

The point isn’t that CPU inference beats GPU inference. It’s that “good enough for personal use” is a much lower bar than “production LLM serving”, and the hardware you already have probably clears it.

The stack

Two components:

llama.cpp (ghcr.io/ggml-org/llama.cpp:server) — inference server that loads a GGUF model file and exposes an OpenAI-compatible REST API. No Python, no framework overhead, minimal memory footprint beyond the model itself.

Open WebUI (ghcr.io/open-webui/open-webui) — a polished chat interface that speaks OpenAI API format. It points at the llama-server endpoint as its backend, handles conversation history, and supports RAG file uploads out of the box.

The architecture is simple on purpose:

Browser → Open WebUI (:80)
              │
              │  OpenAI-compatible API
              ▼
         llama-server (:8080)
              │
              │  reads GGUF model file
              ▼
         hostPath /srv/ai-models

Open WebUI doesn’t know or care that the backend is llama.cpp running on CPU. It sees an OpenAI-compatible API. This matters: if I swap llama-server for Ollama, vLLM, or a cloud endpoint, the frontend doesn’t change. The interface is the standard.

Model choice

GGUF models on Hugging Face are available at multiple quantisation levels. The trade-off is quality vs. RAM:

Model	Quant	Size	RAM at runtime	Notes
Llama-3.2-3B	Q4_K_M	~2 GB	~3 GB	Fastest, lowest quality
Phi-3.5-mini	Q4_K_M	~2.4 GB	~3–4 GB	Good balance — what I use
Mistral-7B-Instruct	Q4_K_M	~4.1 GB	~5–6 GB	Noticeably better, needs more RAM
Llama-3.1-8B	Q4_K_M	~4.7 GB	~6–8 GB	High quality, stretches 16 GB with other workloads

On 16 GB RAM with a full k3s stack running alongside (Argo CD, Traefik, Vault, Prometheus, etc.), Phi-3.5-mini leaves enough headroom that the cluster stays stable. Mistral-7B works too, but it’s tighter.

Models live in /srv/ai-models on the node, mounted into the pod as a hostPath volume. Single-node homelab, so there’s no scheduling concern. Download once with wget, done.

Key configuration choices

Context size (--ctx-size 4096): How many tokens the model holds in its attention window. Larger context = more RAM + slower inference. 4096 is fine for conversational use. If you’re summarising long documents, bump to 8192 and watch your RAM usage.

Max output tokens (--n-predict 1024): Hard cap on response length. llama.cpp will stop there even mid-sentence. 1024 is usually enough; increase if you find it cutting off long explanations.

Parallel slots (--parallel 1): How many concurrent inference requests the server handles. On CPU there’s no benefit to more than 1 — each slot competes for the same cores. Leave it at 1.

Memory limits: Set the container limit to roughly 2× the model’s file size. A 2.4 GB GGUF typically uses 3–4 GB at runtime with context loaded.

resources:
  requests:
    cpu: 500m
    memory: 1Gi
  limits:
    memory: 6Gi

No CPU limit. llama-server will use however many cores are available during inference — that’s what makes it usable. A CPU limit would throttle inference to unusable speeds.

Deployment as a GitOps push

The whole stack lives in one YAML values file, deployed through the extra-objects chart that I use for raw manifests across the cluster. Argo CD watches the repo and reconciles automatically.

Nothing was kubectl apply-ed. The deployment happened by pushing to Git.

What that means in practice: when I bumped the Open WebUI image version, I changed one line, pushed, and Argo CD rolled the pod. No manual steps, no SSH, no kubectl. The same process I use for any other service in the cluster.

The namespace, network policies, service account, and RBAC all generate from a single entry in applications.yml — same as every other app. The AI inference stack isn’t special from an operations perspective.

# applications.yml excerpt
- namespace: web-ai-engine
  applications:
    - applicationCode: web-ai-engine
      path: helm-charts/extra-objects
      autoSync: true

Access and auth

The service is exposed at ai.hippotion.com through the same dual-path ingress setup I use everywhere: Cloudflare Tunnel for external access, direct-to-server via Pi-hole DNS for local access, Traefik handling both with a wildcard Let’s Encrypt cert. See that post for the full explanation.

Auth is handled by Traefik’s ForwardAuth middleware pointing at an oauth2-proxy backed by GitLab. Open WebUI’s own auth is disabled (WEBUI_AUTH: false) — the OAuth layer upstream handles it. One login covers every service in the cluster.

The WEBUI_SECRET_KEY (used to sign Open WebUI sessions) comes from Vault via External Secrets Operator. Nothing sensitive in Git.

What the day-to-day is actually like

Slow is the obvious caveat. Phi-3.5-mini at 3–6 tok/s means a paragraph-length response takes 20–30 seconds. For coding help where you’re reading what came before while it generates, that’s fine. For quick factual lookups, it’s a little tedious.

The useful cases for a local model, for me:

Rephrasing or editing text — paste something, ask it to tighten it. No data leaves the house.
Config explanation — paste a Kubernetes manifest or a Traefik config block, ask what it does. Again, stays local.
Quick summaries — short documents, log snippets, error messages.
Experimentation — trying prompting techniques, testing system prompts, benchmarking quantisation levels without API costs.

For longer reasoning tasks I use a cloud model. The local stack is for the cases where I want the answer to stay on-premises, or where I’m iterating and don’t want to pay per token.

The starting point if you want to try it

The manifests are on GitHub: homelab-ai-inference-starter

It includes the llama-server and Open WebUI deployments, resource configuration, and ingress options for Traefik and nginx. The README walks through downloading a model, applying the manifests, and the configuration knobs worth knowing.

No GPU required. The ThinkCentre in the corner of my desk does the job.