<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Ai on hippotion</title><link>https://blog.hippotion.com/tags/ai/</link><description>Recent content in Ai on hippotion</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Fri, 12 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.hippotion.com/tags/ai/index.xml" rel="self" type="application/rss+xml"/><item><title>Two Birds That Read the Web for Me: One Hoards, One Scatters</title><link>https://blog.hippotion.com/posts/two-birds-read-the-web/</link><pubDate>Fri, 12 Jun 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/two-birds-read-the-web/</guid><description>I gave my second brain two agents that read the outside world and collide it against my notes. A Magpie watches my GitHub stars and only speaks when something hits live work. A Blue Jay reads a handful of RSS feeds and surfaces the distant, not-yet-relevant connection. They share a security spine — and they have deliberately opposite jobs. Here&amp;rsquo;s why the split is the whole design.</description><content:encoded><![CDATA[<p>I have a <a href="/posts/a-second-brain-you-can-git-clone/">vault of markdown notes</a> that
I treat as a second brain, and I <a href="/posts/gitops-for-my-brain/">run GitOps over it</a>
like it&rsquo;s production infrastructure. It already has agents that work on it from
the <em>inside</em>: a <a href="/posts/an-ai-gardener-for-your-second-brain/">nightly gardener</a>
that weeds orphans and suggests links, and a Wanderer that collides random pairs
of my own notes looking for connections I missed.</p>
<p>The obvious next move is to point an agent at the <em>outside</em> — let it read the web
and tell me what matters. That move is also a small landmine, and most &ldquo;AI reads
the internet for you&rdquo; tooling steps right on it. So this week I built two of them
instead of one, named them after corvids, and the reason there are two is the
entire point of this post.</p>
<p>Meet the <strong>Magpie</strong> and the <strong>Blue Jay</strong>.</p>
<h2 id="the-same-fear-twice">The same fear, twice</h2>
<p>Before either bird got a name, both inherited a single non-negotiable rule, and
it&rsquo;s worth saying plainly because it&rsquo;s the part everyone skips:</p>
<blockquote>
<p>An agent that reads the internet and writes to your notes is a prompt-injection
pipeline aimed straight at your trust root.</p>
</blockquote>
<p>My vault isn&rsquo;t just storage. Every <em>other</em> agent — the gardener, the Wanderer,
the search that answers &ldquo;what am I building?&rdquo; — reads it as <strong>trusted context</strong>.
So the moment one agent ingests a GitHub README or a news headline (attacker-
influenceable text) and is allowed to write a note, a stranger on the internet
gets to whisper instructions into the thing my whole system believes. &ldquo;Structured
API&rdquo; narrows that surface. It does not close it.</p>
<p>Both birds are built on the same chassis as the gardener, and that chassis
<em>enforces</em> the fear rather than trusting the model to behave:</p>
<ul>
<li><strong>Two phases, hard split.</strong> A wrapper-owned <strong>FETCH</strong> step pulls the external
text in plain Bash — Claude is not in the loop, can&rsquo;t be talked into anything,
because it isn&rsquo;t running yet. Then a <strong>COLLIDE</strong> step starts <code>claude -p</code> with
the fetched text handed in as inline <em>data</em>, and that process gets only
<code>Read</code> / <code>Glob</code> / <code>Grep</code> / <code>Write</code>. <strong>No Bash, no git, no network, no MCP.</strong>
While untrusted text is in the context window, the agent has no tool that can
reach the outside world or rewrite history.</li>
<li><strong>Allowlist, not the open web.</strong> Each bird reads a short, named list of
sources. Nothing else.</li>
<li><strong>Quarantine, not the vault.</strong> Findings land in <code>quarantine/&lt;bird&gt;/</code>, which
lives <em>outside</em> <code>vault/</code>. The indexer never sees it. Nothing it writes is ever
auto-wikilinked into the graph. Promotion to a real note is a thing <strong>I</strong> do,
by hand, after reading it.</li>
<li><strong>Blast radius is checked, not assumed.</strong> A run may modify <em>only</em> its quarantine
directory. Anything written anywhere else is discarded and reported as a
<code>violation</code>.</li>
<li><strong>&ldquo;Nothing found&rdquo; is a successful run.</strong> Neither bird has a quota. This is the
honesty contract I stole from the Wanderer — an agent under pressure to produce
N findings will manufacture N findings, and manufactured insight is worse than
silence.</li>
</ul>
<p>That&rsquo;s the shared spine. Now the interesting part: given the <em>same</em> security
model, the two birds do almost opposite things, and trying to make one bird do
both jobs would have quietly ruined it.</p>
<h2 id="the-magpie-hoards-whats-already-shiny">The Magpie hoards what&rsquo;s already shiny</h2>
<p>A magpie collects shiny objects and keeps them close. Mine watches <strong>my own
GitHub stars</strong>.</p>
<p>The premise is <em>slow public signal × private context</em>. I starred some repo three
weeks ago, forgot about it, and moved on. Meanwhile my projects shifted. The
Magpie runs weekly, pulls my starred repos through one allowlisted endpoint
(<code>gh api user/starred</code>), and collides each one against what I&rsquo;m <strong>actively
building right now</strong> — the live projects, the open hubs.</p>
<p>Its output contract is a tight one: it is a <strong>relevance filter</strong>. It fires <em>only</em>
when a star actually touches live work, and every finding has to name three
concrete things — the repo, the project it connects to, and one &ldquo;so what.&rdquo; A
vague &ldquo;these are thematically related&rdquo; doesn&rsquo;t count as a hit. It&rsquo;s a watchdog on
the dials, not a newsletter.</p>
<p>The supervised proof run, over 28 stars, surfaced exactly two real hits and
refused to invent a third:</p>
<ol>
<li><strong><code>supertonic</code> (on-device multilingual TTS)</strong> × my
<a href="/posts/clone-your-voice-hungarian-audiobooks/">Hungarian-audiobook voice-cloning project</a>
— a possible escape from a TTS fight I&rsquo;d been losing. I checked: it genuinely
supports Hungarian. That&rsquo;s a hit with a so-what.</li>
<li><strong><code>agentmemory</code></strong> × the exocortex itself — prior art for persistent AI memory,
notably <em>with benchmarks</em> my own notes lacked. (And if you&rsquo;ve read about
<a href="/posts/graph-hurt-my-search/">the time I benchmarked my own search and it lost</a>,
you&rsquo;ll know how much I needed that nudge.)</li>
</ol>
<p>The other ~22 stars mapped to tidy thematic clusters and were correctly <em>not</em>
reported. That restraint is the feature.</p>
<h2 id="the-blue-jay-scatters-acorns-and-forgets-where">The Blue Jay scatters acorns and forgets where</h2>
<p>Here&rsquo;s the bird that explains why there are two.</p>
<p>Blue jays don&rsquo;t hoard close like magpies. They <strong>cache acorns far and wide and
forget where they buried some</strong> — and the forgotten ones grow into oak trees.
Ecologists think blue jays are why oak forests spread north after the last ice
age. Seed dispersal, by way of a bad memory. That is <em>exactly</em> the job I wanted
for the second bird, and the metaphor was too good to pass up.</p>
<p>The Blue Jay reads an allowlist of <strong>eight RSS feeds</strong>, picked so tech and science
cross-pollinate:</p>
<ul>
<li><strong>Tech:</strong> Hacker News (high-score front page), lobste.rs, Ars Technica</li>
<li><strong>Science &amp; ideas:</strong> phys.org, Quanta, Aeon, Nautilus</li>
<li><strong>Wildcard:</strong> Medium — but scoped to specific <em>tag</em> feeds, never the raw
firehose of crypto and self-help</li>
</ul>
<p>Quanta, Aeon, and Nautilus are on that list on purpose: they&rsquo;re the connective
tissue, the feeds where &ldquo;huh, that&rsquo;s weirdly similar to&hellip;&rdquo; happens before my
vault even gets involved.</p>
<p>And its output contract is the <strong>opposite</strong> of the Magpie&rsquo;s. The Blue Jay is a
<strong>serendipity filter</strong>. Its job is to surface the connection that <em>isn&rsquo;t</em> in my
projects yet — the distant idea, the acorn worth burying. If I ran it through the
Magpie&rsquo;s &ldquo;only fire on a live-work hit&rdquo; rule, I would strangle the one thing it
exists to do. Relevance and serendipity pull in opposite directions, and you
can&rsquo;t tune a single agent to maximize both.</p>
<p>One more load-bearing detail, half design and half security: the Blue Jay
<strong>collides on the RSS summary only</strong> — title, abstract, link. It never pulls the
full article body into context. That&rsquo;s simultaneously the lower-injection path
<em>and</em> the right cognitive shape (a headline is a seed; I click through myself from
quarantine if the seed is interesting). The narrow input is doing double duty.</p>
<h2 id="why-two-birds-and-not-one-with-a-flag">Why two birds and not one with a flag</h2>
<p>I genuinely considered making this one agent with a <code>--mode=relevance|serendipity</code>
switch. I&rsquo;m glad I didn&rsquo;t, and the reasoning generalizes past birds:</p>
<table>
	<thead>
			<tr>
					<th></th>
					<th><strong>Magpie</strong></th>
					<th><strong>Blue Jay</strong></th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>Source</td>
					<td>my GitHub stars (structured API)</td>
					<td>8 RSS feeds (open prose)</td>
			</tr>
			<tr>
					<td>Injection risk</td>
					<td>low</td>
					<td>the highest frontier</td>
			</tr>
			<tr>
					<td>Fires when</td>
					<td>a star hits <strong>live work</strong></td>
					<td>a summary sparks a <strong>distant</strong> idea</td>
			</tr>
			<tr>
					<td>Output</td>
					<td>relevance: repo → project → so-what</td>
					<td>serendipity: the not-yet-relevant connection</td>
			</tr>
			<tr>
					<td>Failure mode it guards against</td>
					<td>noise / false relevance</td>
					<td>being strangled into silence</td>
			</tr>
	</tbody>
</table>
<p>Two things made the split non-negotiable. First, <strong>the output contracts are too
different to share one brain</strong> — &ldquo;only speak on a hit&rdquo; and &ldquo;speak about the thing
that isn&rsquo;t a hit yet&rdquo; are contradictory prompts, and a single agent told to do
both does neither well. Second, <strong>open news is a higher injection frontier than a
structured stars API</strong>, so the riskier bird deserves its own enforced blast-radius
wrapper, not a code path bolted onto the safe one. When two jobs disagree on both
<em>what good output is</em> and <em>how dangerous the input is</em>, that&rsquo;s not a flag. That&rsquo;s
two programs.</p>
<p>So now my vault has two more agents reading the world on a cron. The Magpie runs
Saturday at 06:00 and tells me when something I bookmarked finally became
relevant. The Blue Jay runs Saturday at 07:00 and buries acorns in a quarantine
folder, most of which I&rsquo;ll ignore — but I only need one of them to grow into an
oak.</p>
<p>Both are on probation for their first few runs, because I don&rsquo;t trust a thing that
reads the internet until I&rsquo;ve watched it behave. But the part I&rsquo;m actually happy
about isn&rsquo;t the agents. It&rsquo;s that building the <em>second</em> one forced me to say out
loud what the first one was secretly assuming — and the names made the difference
impossible to forget. A magpie hoards. A blue jay scatters. You want both, and
you do not want them to be the same bird.</p>
]]></content:encoded></item><item><title>I Added a Knowledge Graph to My Search. It Made It Worse.</title><link>https://blog.hippotion.com/posts/graph-hurt-my-search/</link><pubDate>Fri, 05 Jun 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/graph-hurt-my-search/</guid><description>My second brain searches over a vault of markdown using BM25 + vectors + graph expansion. I&amp;rsquo;d been telling people the graph improved recall. Then I finally benchmarked it, and plain keyword search beat my fancy hybrid — the graph was actively dragging the right answers out of the results. Here&amp;rsquo;s the scorecard and what it taught me about where graphs actually belong.</description><content:encoded><![CDATA[<p>I have a note in my second brain that I wrote months ago. It says, with the
confidence of someone who hadn&rsquo;t measured anything:</p>
<blockquote>
<p>Combining lexical search (BM25) with vector similarity and graph expansion
produces more robust recall than embeddings alone.</p>
</blockquote>
<p>That sentence shipped into production. My <a href="/posts/a-second-brain-you-can-git-clone/">vault of markdown notes</a>
gets indexed into a search database, and the search function fuses three
signals: BM25 (classic keyword ranking), vector similarity (embeddings), and
<strong>graph expansion</strong> — when a note matches, pull in its linked neighbours too,
on the theory that the thing you want is often <em>next to</em> the thing you typed.</p>
<p>It sounds right. Graphs are having a moment in RAG. &ldquo;Add a knowledge graph to
your retrieval&rdquo; is the kind of thing you can put on a slide and nobody pushes
back. I believed it enough to make graph expansion a first-class signal with a
weight of <code>0.5</code> — equal footing with keyword matching.</p>
<p>This week I finally wrote a benchmark. The graph wasn&rsquo;t helping. It was the
single biggest thing <em>hurting</em> my search.</p>
<h2 id="the-setup">The setup</h2>
<p>30 gold queries against the live vault (63 notes), borrowing the harness shape
from an eval framework I&rsquo;d been reading. Each query has a hand-labelled &ldquo;correct&rdquo;
note. I measured recall@5 (did the right note land in the top 5?) and MRR (how
high did it rank?), across three retrievers:</p>
<ul>
<li><strong>grep</strong> — naive substring term-count. The dumb floor.</li>
<li><strong>bm25</strong> — pure keyword ranking, FTS5&rsquo;s BM25. The honest baseline.</li>
<li><strong>live</strong> — my production hybrid (BM25 + vector + graph).</li>
</ul>
<p>I expected a clean staircase: grep at the bottom, bm25 in the middle, my
clever hybrid on top. That&rsquo;s the whole reason you build the clever thing.</p>
<h2 id="the-scorecard">The scorecard</h2>
<table>
	<thead>
			<tr>
					<th>retriever</th>
					<th>recall@5</th>
					<th>MRR</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>grep</td>
					<td>0.467</td>
					<td>0.307</td>
			</tr>
			<tr>
					<td>bm25</td>
					<td><strong>0.950</strong></td>
					<td><strong>0.826</strong></td>
			</tr>
			<tr>
					<td>live (hybrid, <code>w_graph=0.5</code>)</td>
					<td>0.650</td>
					<td>0.520</td>
			</tr>
	</tbody>
</table>
<p>Read that bottom row twice. My production &ldquo;smart&rdquo; search found the right note
<strong>65%</strong> of the time. Plain keyword search found it <strong>95%</strong> of the time. The
hybrid I&rsquo;d been quietly proud of was <em>worse than its own baseline</em> — it broke
<strong>9 of 30 queries that BM25 got right</strong>. BM25 alone whiffed on exactly one.</p>
<p>The clever layer wasn&rsquo;t adding intelligence. It was adding noise, confidently.</p>
<h2 id="why-the-graph-backfired">Why the graph backfired</h2>
<p>Here&rsquo;s the mechanism, and it&rsquo;s almost funny once you see it.</p>
<p>Graph expansion pulls in a matched note&rsquo;s neighbours. But in a real knowledge
base, the most <em>connected</em> notes are hubs — my inbox of ideas, my project radar,
my &ldquo;things Claude noticed&rdquo; log. Everything links to them, so they&rsquo;re everyone&rsquo;s
neighbour. When I searched for something specific, the graph helpfully dragged
these popularity-contest winners into the candidate set, and they elbowed the
genuinely relevant note clean out of the top 5.</p>
<p>Concrete example. Query: <em>&ldquo;who owns this knowledge system?&rdquo;</em> The correct answer
is my personal note. BM25 ranked it #5 — just barely in. The hybrid, drunk on
graph neighbours, pushed it off the list entirely. The graph didn&rsquo;t find a
better answer. It buried a good one under hubs.</p>
<p>I swept the graph weight to confirm it wasn&rsquo;t a fluke. It was perfectly
monotonic — <strong>every</strong> increment of graph made search worse:</p>
<table>
	<thead>
			<tr>
					<th>graph weight</th>
					<th>recall@5</th>
					<th>MRR</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>0.0 (off)</td>
					<td>0.950</td>
					<td>0.826</td>
			</tr>
			<tr>
					<td>0.1</td>
					<td>0.950</td>
					<td>0.737</td>
			</tr>
			<tr>
					<td>0.25</td>
					<td>0.817</td>
					<td>0.564</td>
			</tr>
			<tr>
					<td>0.5 (what I shipped)</td>
					<td>0.650</td>
					<td>0.520</td>
			</tr>
	</tbody>
</table>
<p>There&rsquo;s no ambiguity to argue with. More graph, more harm, no exceptions. The
value I&rsquo;d been <em>claiming</em> in that confident note — I finally measured it, and
it was negative.</p>
<h2 id="the-fix-and-the-actual-lesson">The fix, and the actual lesson</h2>
<p>The fix was one line: drop the default graph weight from <code>0.5</code> to <code>0.1</code>. Recall
snapped back to 0.95, tying pure BM25. (Turning the graph fully off is
marginally better still on MRR; I kept a whisper of it as a tiebreaker, which is
a taste call, not a data-driven one.)</p>
<p>But the one-line fix isn&rsquo;t the point. The point is <em>where graphs belong</em>.</p>
<p>Graph expansion isn&rsquo;t a bad idea — I aimed it at the wrong job. <strong>Precision
retrieval</strong> (&ldquo;find me the one note that answers this&rdquo;) wants to be narrow and
literal. Pulling in neighbours is the opposite of what you want; every neighbour
is a chance to be wrong. But I have a <em>different</em> feature in this same system —
a discovery mode that deliberately collides distant notes to surface unexpected
connections. There, neighbour-pulling isn&rsquo;t noise, it&rsquo;s the entire product.</p>
<p>Same mechanism. One context it&rsquo;s poison, the other it&rsquo;s the point. I&rsquo;d been
running my discovery tool inside my lookup tool and calling it a hybrid.</p>
<p>A few honest caveats, because a benchmark you can&rsquo;t poke holes in is usually
lying: my gold set is self-authored v1, the corpus is small (63 notes), and the
vector signal was actually <em>dark</em> during this run — I hadn&rsquo;t built the
embeddings yet, so &ldquo;hybrid&rdquo; here was really &ldquo;BM25 + graph.&rdquo; The vector half of
my original claim is still untested. This is directional, not gospel.</p>
<p>But directional was enough. I&rsquo;d shipped a claim, the claim got measured, and it
didn&rsquo;t survive contact with 30 queries. That&rsquo;s the whole reason I <a href="/posts/gitops-for-my-brain/">keep my
brain in git with everything reproducible</a>: so the
day I bother to measure, the measurement can actually win the argument against
my own confident prose.</p>
<p>The slide-deck version of RAG says <em>add a graph</em>. The benchmark says <em>know which
question you&rsquo;re answering first</em>. I&rsquo;ll take the benchmark.</p>
]]></content:encoded></item><item><title>VoteWatch: How Your Representatives Voted — and Whether You'd Agree</title><link>https://blog.hippotion.com/posts/votewatch/</link><pubDate>Fri, 15 May 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/votewatch/</guid><description>Parliamentary roll-call votes are public, machine-readable, and almost completely unread. I built a thing that scrapes them, distills each decision into one plain-language question, shows which party voted which way, and lets you register whether you agree — then puts your answer next to how parliament actually voted. The rule that keeps it honest: the AI writes the summary, but it never decides a fact.</description><content:encoded><![CDATA[<h2 id="open-data-nobody-opens">Open data nobody opens</h2>
<p>Every vote in the European Parliament and the Slovak National Council is
public. The EU even ships it as a clean API. And almost nobody reads it,
because the raw record is unreadable: <em>&ldquo;Návrh poslanca… ktorým sa dopĺňa zákon
č. 581/2004 Z. z. … (tlač 1259) — tretie čítanie, hlasovanie o návrhu zákona
ako o celku.&rdquo;</em> Multiply that by a few hundred votes a sitting. Transparency
that no human can parse is transparency on paper only.</p>
<p>So I built <strong>VoteWatch</strong> — a small site on my homelab that turns the record
into something a citizen can actually use: <em>what was decided, who voted, and
do you agree?</em></p>
<figure>
    <img loading="lazy" src="sk-overview.png"
         alt="VoteWatch SK in plain-language mode"/> <figcaption>
            <p>VoteWatch SK: each decision summarised in plain language, which parties voted how, and a Yes/No question whose live citizen tally sits next to how parliament actually voted — labelled <em>agree</em> or <em>gap</em>.</p>
        </figcaption>
</figure>

<h2 id="two-halves-one-lopsided">Two halves, one lopsided</h2>
<p>The EU half was easy. <a href="https://howtheyvote.eu">HowTheyVote.eu</a> already did the
hard work and publishes roll-call votes as a clean, open-licensed API. You
consume it; you don&rsquo;t scrape it.</p>
<p>The Slovak half is where the real work lives — and the real value. <code>nrsr.sk</code>
has <strong>no API</strong>. The HTML is the contract: a results listing, and per-vote
pages where each MP appears next to a one-letter code (<code>[Z]</code> za, <code>[P]</code> proti,
<code>[?]</code> zdržal sa). So the national half is a genuine scraper — the unglamorous
kind that nobody maintains, which is exactly why a gap exists to fill. The
unglamorous part <em>is</em> the moat.</p>
<h2 id="from-ten-votes-to-one-question">From ten votes to one question</h2>
<p>A single bill generates a pile of procedural roll-calls — shorten the debate,
move to third reading, amendment block A, amendment block B, the bill as a
whole. Ten rows that are really one decision. Nobody wants ten rows.</p>
<p>So the pipeline groups votes by bill, then asks an LLM (llama-3.3-70b on
NVIDIA NIM) to do exactly one job: turn the bureaucratic titles into a plain
headline, two sentences of summary, and <strong>one neutral Yes/No question</strong> a
person can actually answer. Seven votes on the health-insurer bill collapse
into: <em>&ldquo;Changes to the health-insurance law&rdquo;</em> → <em>&ldquo;Do you agree with the
health-insurance bill?&rdquo;</em></p>
<h2 id="the-rule-that-keeps-it-honest">The rule that keeps it honest</h2>
<p>Here&rsquo;s the line I won&rsquo;t cross, and it&rsquo;s the whole reason I trust the result:
<strong>the AI writes the prose, but it never decides a fact.</strong></p>
<ul>
<li>Which votes belong to one bill? Deterministic — parsed from the bill number.</li>
<li>Did it pass? Deterministic — read from the result row.</li>
<li>Which parties voted for, against, abstained? Deterministic — tallied from
the per-MP record, shown as <em>Za: SMER-SD, HLAS-SD, SNS · Zdržali sa: PS, KDH,
SaS</em>.</li>
</ul>
<p>The model only touches language: the headline, the summary, the question. If
it hallucinates, you get an awkward sentence — never a wrong vote count. And
if the model fails entirely, the card falls back to the raw title. The facts
come from the record; the model just makes the record legible. For civic data,
that separation isn&rsquo;t a nice-to-have — it&rsquo;s the difference between a tool and a
liability. (Every card says so out loud: <em>summaries are AI-generated; the raw
record prevails.</em>)</p>
<h2 id="the-part-that-closes-the-loop">The part that closes the loop</h2>
<p>Showing people how their representatives voted is only half a feedback loop.
The other half is letting them answer.</p>
<p>Each decision carries its one distilled question and two buttons — <strong>Áno / Nie</strong>.
You vote, and the site shows the citizen tally <em>next to</em> how parliament
actually decided, with the honest verdict on top: <em>&quot;✓ Citizens and Parliament
agree&quot;</em> or <em>&quot;⚖ Gap between citizens and Parliament.&quot;</em> That gap is the entire
point. It&rsquo;s the thesis behind a side project of mine called
<a href="https://veracracy.hippotion.com">veracracy</a> — governance measured against
verified knowledge and the actual will of the governed — made concrete enough
to click.</p>
<figure>
    <img loading="lazy" src="eu-overview.png"
         alt="VoteWatch EU overview mode"/> <figcaption>
            <p>The same loop on the European Parliament — dossiers consolidated, political-group stances (EPP, S&amp;D, PfE…), and the citizen poll under each topic.</p>
        </figcaption>
</figure>

<p>The backend is deliberately boring. The site is static (git-synced nginx,
same as this blog). Votes can&rsquo;t POST to a static page, so they go to a public
<a href="https://n8n.hippotion.com">n8n</a> webhook that records to a data table and
returns live tallies — no new service, no database, just the automation box I
already run. Vote keys are namespaced so EU and Slovak polls share one store
without colliding.</p>
<h2 id="the-honest-caveat">The honest caveat</h2>
<p>Dedup is browser-local. It stops casual double-voting, but behind a Cloudflare
tunnel every request shares one IP, so this is an <strong>indicative signal, not a
secured ballot</strong>. That&rsquo;s the right altitude for &ldquo;let people express an
opinion.&rdquo; The day it needs to mean more than that, it needs real identity
first — and I&rsquo;d rather ship the honest version than fake the robust one.</p>
<p>It&rsquo;s live at <a href="https://votewatch.hippotion.com">votewatch.hippotion.com</a> — the
EU parliament and the Slovak NR SR, every MEP and every poslanec, in plain
language, with a button that asks the only question that matters after a vote:
<strong>would you have voted the same way?</strong></p>
<p>A neutral record — what was decided and who decided it — not a villain list.
Data © <a href="https://howtheyvote.eu">HowTheyVote.eu</a> (ODbL) and <code>nrsr.sk</code>.</p>
]]></content:encoded></item><item><title>Veracracy: The Question We Forget to Ask When We Govern</title><link>https://blog.hippotion.com/posts/veracracy/</link><pubDate>Fri, 08 May 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/veracracy/</guid><description>I built a clock that counts down to a form of government that doesn&amp;rsquo;t exist yet — legitimacy grounded in verified knowledge rather than power, wealth, or whoever shouts loudest. The only reason I&amp;rsquo;m not embarrassed to have built it: the clock can run backward, the assumption behind it is published in plain sight, and the first concrete brick already ships real parliamentary data. A measurement, not a prophecy.</description><content:encoded><![CDATA[<h2 id="the-missing-question">The missing question</h2>
<p>Democracies ask <em>what do we want?</em> Markets ask <em>what will we pay?</em> Both are
good questions, and between them they run most of the world. But there&rsquo;s a
third question that almost no system asks before it acts, and it&rsquo;s the one that
decides whether the first two produce anything good:</p>
<p><em>What do we actually know — and how do we know it?</em></p>
<p>I gave the idea of governing as if that question came first a name —
<strong>veracracy</strong>, from <em>veritas</em> (truth) and <em>kratos</em> (rule). Not rule <em>by</em>
experts in a back room. Rule by <strong>evidence that anyone can inspect</strong>,
deliberated by the people it binds. It lives at
<a href="https://veracracy.hippotion.com">veracracy.hippotion.com</a>, and this is the
honest account of what it is and why an infrastructure engineer ended up
building a shrine to an idea.</p>
<figure>
    <img loading="lazy" src="clock.png"
         alt="The veracracy clock at veracracy.hippotion.com"/> <figcaption>
            <p>The clock reads ~2051 — computed, not wished, from one published assumption. The sun&rsquo;s height above the horizon is how far the weighted dials have risen; the tag tracks the beacon (Taiwan) against the world.</p>
        </figcaption>
</figure>

<h2 id="what-it-actually-means">What it actually means</h2>
<p>Strip the romance and veracracy is five fairly concrete commitments:</p>
<ul>
<li><strong>Truth as civic infrastructure.</strong> Verified, open evidence maintained like
roads and water — with provenance, versioning, and repair crews. A public
utility, not a content feed.</li>
<li><strong>Radical transparency.</strong> Binding decisions carry their evidence trail by
default. <em>A law without its sources is a claim, not a law.</em></li>
<li><strong>Decentralised trust.</strong> No ministry of truth. Verification is plural,
adversarial, and bridging — many checkers, no single owner, consensus that
has to span camps to count.</li>
<li><strong>Ethical AI as auditor and advocate.</strong> Machines that trace claims, surface
contradictions, and argue <em>against</em> the powerful reading of the data — never
as oracle, always as instrument.</li>
<li><strong>Participatory epistemocracy.</strong> Citizens not as voters once every four
years, but as standing jurors of what is true enough to act on, where weight
accrues to evidence rather than volume.</li>
</ul>
<p>If you squint, none of that is a political program. It&rsquo;s the same instinct I
bring to a cluster — <em>provenance, versioning, a reconciler, no single point of
trust</em> — pointed at the question of how a society decides what&rsquo;s real. That&rsquo;s
the only way I know how to think, so that&rsquo;s the lens I used.</p>
<h2 id="why-a-clock-and-why-it-can-run-backward">Why a clock, and why it can run backward</h2>
<p>Here&rsquo;s where most &ldquo;vision&rdquo; projects lose me, and where I tried not to lose
myself. A manifesto is cheap. Anyone can declare a better world and feel
moral. So instead of a manifesto, the site is a <strong>measurement</strong>.</p>
<p>It shows a single year — <em>first light</em>, the year the first place on Earth
might plausibly govern this way. Right now it reads around <strong>2051</strong>. But that
number isn&rsquo;t a wish; it&rsquo;s computed, from one assumption stated in plain sight:
<em>the infrastructure of verified governance has been built since the world went
online in 1991, and continues at the average pace it has held since.</em> A set of
dials — open data, civic tech, verification at platform scale — each scored
0–1 from a named source, weighted, extrapolated. Change the assumption and the
number moves: pace the same dials from Athens in 508 BC instead, and dawn lands
near the year 3860. So the assumption is the lever, which is exactly why it&rsquo;s
published.</p>
<p>And the clock can run <em>backward</em>. There&rsquo;s a Watch — a standing log of what
moved the year — that deliberately files the evidence <em>against</em>: every
transparency rollback, every deliberation experiment that failed, every force
pushing dawn further out. Because <strong>a sunrise that only ever gets closer is a
marketing widget.</strong> The honesty is the product. If I can&rsquo;t show you the thing
that would move the number the wrong way, you shouldn&rsquo;t believe the number.</p>
<h2 id="the-first-brick">The first brick</h2>
<p>Ideas this size are easy to admire and easy to never touch. So the rule I set
myself is that veracracy has to cash out in things that actually run.</p>
<p>The first one shipped this week:
<a href="https://blog.hippotion.com/posts/votewatch/">VoteWatch</a> — every roll-call vote
in the European Parliament and the Slovak National Council, scraped from the
public record, distilled into plain language, showing which party voted which
way, with a button that asks you whether you&rsquo;d have voted the same. It&rsquo;s the
third and fifth pillars made clickable: <em>binding decisions carrying their
evidence trail</em>, and <em>citizens as standing jurors</em> rather than spectators. The
gap it surfaces — between how parliament voted and how the people who answered
would have — is veracracy in miniature, on real data, today.</p>
<p>It&rsquo;s small. It&rsquo;s one person&rsquo;s homelab. The voting is an indicative signal, not
a secured ballot, and I say so on the page. But it&rsquo;s the difference between a
belief and a brick, and I would rather lay one honest brick than write a
beautiful manifesto.</p>
<h2 id="where-this-comes-from">Where this comes from</h2>
<p>I&rsquo;ll be straight about the shape of this: it&rsquo;s idealistic, it&rsquo;s personal, and
I don&rsquo;t expect to see first light. I&rsquo;m a solo operator in a small town who runs
a rack of servers and thinks too much about how systems stay trustworthy when
no one&rsquo;s watching them. Veracracy is what happened when that instinct refused
to stay inside the server room.</p>
<p>The version of this I can defend isn&rsquo;t the dream — it&rsquo;s the discipline around
the dream. Publish your assumption. Let the clock run backward. Cash the idea
out in something real. Credit your sources. Ship the honest version, not the
robust-sounding one.</p>
<p>A measurement with one stated assumption — not a prophecy. The clock&rsquo;s at
<a href="https://veracracy.hippotion.com">veracracy.hippotion.com</a>; disagree with a
dial and you&rsquo;ve understood the point.</p>
]]></content:encoded></item><item><title>I Run GitOps for My Brain</title><link>https://blog.hippotion.com/posts/gitops-for-my-brain/</link><pubDate>Fri, 01 May 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/gitops-for-my-brain/</guid><description>An AI agent on a scheduled idle walk through my notes pointed out that I&amp;rsquo;d built the same architecture three times — at work, in my homelab, and in my second brain — and that the third copy was missing the part that makes GitOps work. It was right. So we shipped the missing piece the same day.</description><content:encoded><![CDATA[<h2 id="the-pattern-i-didnt-know-i-had">The pattern I didn&rsquo;t know I had</h2>
<p>This week an AI agent told me something about my own systems that I&rsquo;d never
noticed, and it was correct: I have one favorite architecture, and I&rsquo;ve built
it three times.</p>
<ul>
<li><strong>At work</strong>: git holds Terraform code → Terraform derives the S3 buckets.
Nobody clicks around in the AWS console; the repo is the truth.</li>
<li><strong>In the homelab</strong>: git holds Kubernetes manifests → ArgoCD derives the
cluster. Every app on my rack is a folder in a repo.</li>
<li><strong>In my second brain</strong>: a vault of markdown notes → an indexer derives the
search database (SQLite FTS + a link graph) that my AI tools query.</li>
</ul>
<p>Same shape everywhere: a plain-text source of truth in git, and a machine that
builds the real thing from it. Master copy, derived state. I never decided
this consciously — it&rsquo;s just how my hands build things now.</p>
<h2 id="gitops-isnt-the-git-part">GitOps isn&rsquo;t the git part</h2>
<p>Here&rsquo;s the thing that the third copy got wrong, and it took me embarrassingly
long to see because I <em>teach</em> this pattern at the infrastructure layer.</p>
<p>&ldquo;Configuration in git&rdquo; existed long before GitOps. What made GitOps an actual
shift was the <strong>reconciler</strong>: ArgoCD doesn&rsquo;t apply your manifests once and
wish you luck. It watches, continuously. When the cluster drifts from the
repo, you get an <code>OutOfSync</code> badge, and with <code>selfHeal</code> enabled it puts
reality back where the repo says it should be. The loop is the product. Git
is just where the loop points.</p>
<p>My vault had no loop. If I edited a note and forgot to rebuild the index, the
search results my AI agents rely on were silently stale — no badge, no error,
nothing. The only protection was a rule in the repo&rsquo;s agent instructions:
<em>&ldquo;if files and index disagree, the files win — run the indexer.&rdquo;</em></p>
<p>A policy that agents must remember. In other words: I was running Kubernetes
with a sticky note on the monitor that says <em>please redeploy after editing
the YAML</em>. I would never accept that on my cluster. My brain ran on it for
months.</p>
<h2 id="the-fix-took-an-afternoon">The fix took an afternoon</h2>
<p>Two pieces, both boring on purpose.</p>
<p><strong><code>exo status</code></strong> — the OutOfSync badge. The indexer now stores a content hash
per note; <code>status</code> re-hashes the vault and diffs:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;status&#34;</span><span class="p">:</span> <span class="s2">&#34;OutOfSync&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;modified&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;vault/10-notes/interests-themes.md&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;new&#34;</span><span class="p">:</span> <span class="p">[],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;deleted&#34;</span><span class="p">:</span> <span class="p">[],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;repair&#34;</span><span class="p">:</span> <span class="s2">&#34;exo index&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>Exit code 0 when synced, 1 when not — so scripts and CI can ask the question
too, exactly like <code>argocd app get</code>.</p>
<p><strong>Git hooks</strong> — the selfHeal. Versioned hooks (<code>core.hooksPath .githooks</code>) on
<code>post-commit</code> and <code>post-merge</code> rebuild the index after every commit and pull:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl"><span class="nb">command</span> -v exo &gt;/dev/null 2&gt;<span class="p">&amp;</span><span class="m">1</span> <span class="o">||</span> <span class="nb">exit</span> <span class="m">0</span>
</span></span><span class="line"><span class="cl"><span class="nv">EXO_ROOT</span><span class="o">=</span><span class="s2">&#34;</span><span class="k">$(</span>git rev-parse --show-toplevel<span class="k">)</span><span class="s2">&#34;</span>
</span></span><span class="line"><span class="cl">exo index &gt;/dev/null 2&gt;<span class="p">&amp;</span><span class="m">1</span> <span class="o">&amp;&amp;</span> <span class="nb">echo</span> <span class="s2">&#34;exo: index reconciled (Synced)&#34;</span>
</span></span></code></pre></div><p>Now every <code>git commit</code> in the vault prints <code>exo: index reconciled (Synced)</code>
on its way out. The rule didn&rsquo;t change — <em>files win</em> — but it stopped being
something agents must remember and became something a machine enforces.
That&rsquo;s the entire difference between configuration management and GitOps,
replayed at the knowledge layer.</p>
<h2 id="the-part-where-it-gets-a-little-strange">The part where it gets a little strange</h2>
<p>The reason I&rsquo;m writing this post at all: I didn&rsquo;t have this idea. A scheduled
agent did, on what I can only describe as an idle walk.</p>
<p>My vault has a weekly cron job — we call it the Wanderer — that samples pairs
of notes that are <em>far apart</em>: different folders, different months, almost no
shared vocabulary. A headless Claude gets the pairs with exactly one task:
<em>read both notes in full and say whether anything genuinely connects. &ldquo;Nothing
connects&rdquo; is a successful run.</em> That last sentence is load-bearing — the run
always reports its result either way, so the agent never needs to manufacture
a finding to have done its job.</p>
<p>On its very first walk, it collided a work note about Terraform-driven S3
provisioning with the architecture map of the vault itself, and wrote: <em>same
sentence in different clothes — and the brain copy is missing its
reconciler.</em> Then it listed the two fixes you just read about.</p>
<p>Retrieval answers the questions you ask. Distant collisions surface the
questions you didn&rsquo;t know you had. It turns out my second brain didn&rsquo;t need
to get better at remembering — it needed to occasionally interrupt me.</p>
<h2 id="if-you-keep-a-vault">If you keep a vault</h2>
<p>Whatever your stack — Obsidian, org-mode, a folder of markdown — if anything
<em>derives</em> from your notes (an index, embeddings, a published site), then you
have source of truth and derived state, and the GitOps question applies: <strong>who
notices when they drift?</strong> If the answer is &ldquo;I do, hopefully,&rdquo; you&rsquo;re running
the sticky-note era. Give it a badge and a loop. It&rsquo;s an afternoon.</p>
]]></content:encoded></item><item><title>🚩 I Built a Usage Dashboard and Tripped Claude Fable 5's Safety Net</title><link>https://blog.hippotion.com/posts/when-claude-flagged-my-own-dashboard/</link><pubDate>Fri, 24 Apr 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/when-claude-flagged-my-own-dashboard/</guid><description>I asked Claude Fable 5 to help me self-host a dashboard for my own Claude usage. Halfway through, its dual-use safety measures flagged the conversation and downshifted me to Opus 4.8. Nothing I did was wrong — the request just had the shape of something that is. That gap, between what a thing looks like and what it&amp;rsquo;s for, turns out to be the whole story.</description><content:encoded><![CDATA[<h2 id="the-thing-i-was-actually-building">The thing I was actually building</h2>
<p>I wanted a small web page on my homelab that shows my Claude usage — the 5-hour
session window, the weekly limits, the per-model split. There&rsquo;s a nice Electron
widget out there that does this on the desktop, but I don&rsquo;t want a desktop app; I
want a URL behind my own OAuth that I can glance at from my phone.</p>
<p>The mechanics are unremarkable. The claude.ai web app reads those numbers from a
couple of undocumented endpoints using your logged-in session cookie. So a
self-hosted version does the same thing server-side: hold the session token as a
secret, replay the same calls, cache the result, render some bars. An afternoon&rsquo;s
work. I was pairing with <strong>Claude Fable 5</strong> on it — Anthropic&rsquo;s newest model, and
the one that ships with extra safety measures around dual-use capability.</p>
<p>Then, partway through, I got the message: <em>Fable 5 flagged something in this
session and switched to a more conservative model.</em> It dropped me to <strong>Opus 4.8</strong>
for the rest of the conversation. Safe conversations sometimes trip it, the notice
said. Send feedback.</p>
<h2 id="i-wasnt-doing-anything-wrong-thats-the-interesting-part">I wasn&rsquo;t doing anything wrong. That&rsquo;s the interesting part.</h2>
<p>My first reaction was the obvious one — <em>what did I say?</em> But I knew exactly what
I&rsquo;d built, and none of it was sketchy. It was my account, my usage data, my
hardware, my OAuth in front of it.</p>
<p>So I went looking at the request the way a classifier would — not &ldquo;what did he
mean&rdquo; but &ldquo;what does this look like.&rdquo; And from that angle it&rsquo;s a different
picture entirely. Stack up the surface features:</p>
<ul>
<li>🔑 capturing a <strong>session token</strong> and storing it to replay later</li>
<li>🌐 sending it to an <strong>undocumented API</strong> that isn&rsquo;t meant for third parties</li>
<li>🕵️ spoofing a <strong>browser User-Agent</strong> so the request blends in</li>
<li>🧱 detecting and working around a <strong>Cloudflare bot challenge</strong></li>
</ul>
<p>Read that list cold, with no context. That&rsquo;s not a usage dashboard. That&rsquo;s the
exact signature of credential theft and scraping tooling. Every individual move
is one a malicious script would also make. The only thing separating my afternoon
project from the bad version is <em>whose</em> account it touches and <em>why</em> — and intent
is precisely the part that doesn&rsquo;t show up in the tokens.</p>
<h2 id="surface-vs-intent">Surface vs. intent</h2>
<p>This is the part worth sitting with, because it&rsquo;s not a Claude quirk — it&rsquo;s the
shape of every content classifier, every WAF rule, every fraud model I&rsquo;ve ever
run in production.</p>
<p>A detector scores what it can see. It cannot see intent; it sees features. And
the features of &ldquo;monitor my own usage&rdquo; and &ldquo;harvest someone else&rsquo;s session&rdquo;
overlap almost completely, because the <em>technique</em> is identical — the difference
lives entirely in context the model has been deliberately built not to over-trust.
You can&rsquo;t tune that gap away. You can only pick where to sit on the
precision/recall curve, and Fable 5 — being the high-capability model with the
extra dual-use measures bolted on — sits where it catches the pattern even when it
costs some false positives, then hands off to Opus 4.8. I was the false positive.
The system did roughly the right thing for roughly the right reason; it just
doesn&rsquo;t feel that way when it&rsquo;s pointed at you.</p>
<p>The honest engineering takeaway is the one I keep relearning: <strong>if a benign task
has the silhouette of an abusive one, expect to get treated like the silhouette.</strong>
Not just by AI — by rate limiters, by bot detection, by the fraud team. The fix
isn&rsquo;t to be offended. It&rsquo;s to recognize the silhouette, and where it matters,
make the legitimate context legible up front.</p>
<h2 id="what-id-do-differently">What I&rsquo;d do differently</h2>
<p>Practically, very little — the project was fine, and it downshifted to a model
that finished the job. But the framing changed how I built it. I leaned harder
into the parts that make intent <em>visible in the design</em>: the session token never
leaves the server, it lives in Vault and arrives as an injected secret, the whole
thing sits behind OAuth, and it polls on a leash instead of hammering. Not because
a classifier made me, but because those are the same choices that make it
obviously a personal dashboard and not a harvesting bot — to a reviewer, to
future-me, and yes, to a model reading over my shoulder.</p>
<p>The widget rides your credential on your desktop. Mine keeps it server-side behind
my own front door. Turns out building it the trustworthy way and building it the
<em>legibly</em> trustworthy way are the same work — and getting flagged is what made me
notice the difference.</p>
]]></content:encoded></item><item><title>🎙️ Cloning My Own Voice for My Kid's Audiobooks</title><link>https://blog.hippotion.com/posts/clone-your-voice-hungarian-audiobooks/</link><pubDate>Fri, 13 Mar 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/clone-your-voice-hungarian-audiobooks/</guid><description>Zero-shot voice cloning with XTTS-v2 on a CPU-only k3s node: 26 seconds of phone audio in, a cloned-voice audiobook out — and an honest verdict from the bedtime jury. Every manual step, including the ones that went wrong.</description><content:encoded><![CDATA[<h2 id="the-problem-nobody-sells-a-fix-for">The problem nobody sells a fix for</h2>
<p>My kid loves audiobooks. The commercial platforms barely carry Hungarian
children&rsquo;s books, and none of them carry the one narrator my kid actually
prefers: me. I can&rsquo;t read aloud every evening — but my homelab doesn&rsquo;t have
that excuse.</p>
<p>The platform half (ebook → M4B → Audiobookshelf on k3s) is a story for
another post. This one is about the voice: how to go from a phone recording
to an audiobook narrated in your own voice, step by step, on hardware with
no GPU.</p>
<p>The short version: <strong>XTTS-v2 does zero-shot voice cloning from a ~20-second
sample.</strong> No training, no fine-tuning, no dataset. One clean recording and a
flag.</p>
<hr>
<h2 id="why-xtts-v2-in-2026">Why XTTS-v2, in 2026?</h2>
<p>It&rsquo;s not the best open TTS model anymore. Chatterbox beats ElevenLabs in
blind tests; F5-TTS sounds cleaner. But model selection for a small language
is constraint-first, not leaderboard-first: Chatterbox has <strong>no Hungarian</strong>,
NVIDIA&rsquo;s TTS NIMs have <strong>no Hungarian</strong>, Kokoro — no Hungarian. XTTS-v2
speaks Hungarian <em>and</em> clones voices <em>and</em> runs on CPU. That intersection
has exactly one resident.</p>
<p>I run it via <a href="https://github.com/DrewThomasson/ebook2audiobook">ebook2audiobook</a>,
which wraps XTTS with Calibre ingestion and M4B chaptering.</p>
<hr>
<h2 id="step-1--record-25-seconds-of-yourself">Step 1 — Record ~25 seconds of yourself</h2>
<p>Phone voice-memo app, quiet room, ~20 cm from your mouth. Mine came out as
28 seconds of stereo 48 kHz AAC. Two rules that matter more than gear:</p>
<ul>
<li><strong>Read the way you want the books narrated.</strong> The clone copies prosody —
energy, pacing, warmth — not just timbre. A flat recital clones into a
flat narrator. I read a children&rsquo;s tale the way I&rsquo;d read it at bedtime.</li>
<li><strong>Don&rsquo;t peak the mic.</strong> My sample hit −0.1 dB max volume — right at the
clipping ceiling. It worked, but quieter is safer. Check yours:</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">ffmpeg -i janos.m4a -af volumedetect -f null - 2&gt;<span class="p">&amp;</span><span class="m">1</span> <span class="p">|</span> grep volume
</span></span><span class="line"><span class="cl"><span class="c1"># mean_volume: -21.4 dB   ← fine</span>
</span></span><span class="line"><span class="cl"><span class="c1"># max_volume:  -0.1 dB    ← living dangerously</span>
</span></span></code></pre></div><hr>
<h2 id="step-2--normalize-to-what-xtts-wants">Step 2 — Normalize to what XTTS wants</h2>
<p>XTTS expects a mono WAV; 24 kHz matches its internal rate. Trim the silence
off both ends while you&rsquo;re at it:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">ffmpeg -i janos.m4a <span class="se">\
</span></span></span><span class="line"><span class="cl">  -af <span class="s2">&#34;silenceremove=start_periods=1:start_threshold=-45dB:start_silence=0.2,\
</span></span></span><span class="line"><span class="cl"><span class="s2">areverse,silenceremove=start_periods=1:start_threshold=-45dB:start_silence=0.2,\
</span></span></span><span class="line"><span class="cl"><span class="s2">areverse&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">  -ar <span class="m">24000</span> -ac <span class="m">1</span> janos.wav
</span></span></code></pre></div><p>(The double-<code>areverse</code> is the classic trick: <code>silenceremove</code> only trims the
front, so you flip the audio, trim the front again, flip it back.)</p>
<p>Drop the result where your TTS stack looks for voices. In ebook2audiobook
that&rsquo;s the <code>voices/</code> tree, organised by language:</p>
<pre tabindex="0"><code>voices/hun/adult/male/janos.wav
</code></pre><hr>
<h2 id="step-3--synthesize">Step 3 — Synthesize</h2>
<p>One flag does the cloning. Headless run on the k3s pod:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">kubectl <span class="nb">exec</span> -n web-audiobooks deploy/ebook2audiobook -- sh -c <span class="se">\
</span></span></span><span class="line"><span class="cl">  <span class="s1">&#39;cd /app &amp;&amp; python app.py --headless \
</span></span></span><span class="line"><span class="cl"><span class="s1">     --ebook &#34;/app/ebooks/tale.txt&#34; \
</span></span></span><span class="line"><span class="cl"><span class="s1">     --language hun \
</span></span></span><span class="line"><span class="cl"><span class="s1">     --tts_engine xtts \
</span></span></span><span class="line"><span class="cl"><span class="s1">     --device cpu \
</span></span></span><span class="line"><span class="cl"><span class="s1">     --voice /app/voices/hun/adult/male/janos.wav \
</span></span></span><span class="line"><span class="cl"><span class="s1">     --output_format m4b \
</span></span></span><span class="line"><span class="cl"><span class="s1">     --output_dir /app/audiobooks&#39;</span>
</span></span></code></pre></div><p>On my 12-core CPU node this runs at roughly 3× real-time — a 2-minute tale
takes ~8 minutes, a full children&rsquo;s book is an overnight job. The first run
computes speaker latents from your WAV; after that it&rsquo;s ordinary synthesis
with your voice as the reference.</p>
<hr>
<h2 id="step-4--ab-before-you-batch">Step 4 — A/B before you batch</h2>
<p>Render one <em>short</em> book twice — stock narrator and cloned voice — and put
both in front of the household jury. Cloning quality is personal in the most
literal sense: MOS scores won&rsquo;t tell you whether it sounds like <em>you</em>. My
benchmark has strong opinions and goes to bed at eight.</p>
<p>Only after the clone passes do you re-render the library with <code>--voice</code>.</p>
<p><img alt="Audiobookshelf library with the same tale twice: stock narrator and the &ldquo;apa hangján&rdquo; clone, side by side for the jury" loading="lazy" src="/posts/clone-your-voice-hungarian-audiobooks/abs-ab.png"></p>
<hr>
<h2 id="the-manual-steps-that-earn-the-word-manual">The manual steps that earn the word &ldquo;manual&rdquo;</h2>
<p>Things the tutorials skip, learned the slow way:</p>
<ul>
<li><strong>Long conversions die with the browser tab.</strong> Gradio-style web UIs tie
the job to the open page; close the laptop and you get &ldquo;Conversion
cancelled&rdquo; half a book in. Anything longer than ~15 minutes of audio runs
headless under <code>nohup</code>.</li>
<li><strong>CPU synthesis leaks memory over hours.</strong> My pod has a hard 6 Gi limit on
a 16 Gi node, and a 6-hour run will hit it. Keep the cap (it protects the
other 30 namespaces), and rely on the tool&rsquo;s <code>--session &lt;id&gt;</code> resume — it
picks up at the exact sentence. One catch: headless resume still asks an
interactive <code>Resume? [y]es</code> — pipe <code>echo y |</code> into it.</li>
<li><strong>The per-chapter FLACs survive a crash.</strong> If the final M4B muxing step
OOMs, don&rsquo;t re-synthesize: the chapters are sitting in the session&rsquo;s tmp
directory, and <code>ffmpeg</code> will assemble them into a chaptered M4B with a
hand-written FFMETADATA file in about two minutes, at near-zero memory.</li>
</ul>
<p>None of this is hard. It&rsquo;s just undocumented — which is the gap between
&ldquo;there&rsquo;s a model for that&rdquo; and your kid pressing play.</p>
<hr>
<h2 id="postscript-the-jury-came-back">Postscript: the jury came back</h2>
<p>The clone failed. Recognizably my timbre, nowhere near natural — I wouldn&rsquo;t
play it to my kid, which is the only metric that exists for this project.</p>
<p>Worth being precise about <em>what</em> failed: the stock XTTS-v2 narrator passed
the ear test and the library keeps growing with it. Zero-shot <strong>cloning</strong> is
the part that fell short — a 2023 model conditioning on 26 seconds of a
voice it has never seen, in a language that was never its strong suit. The
pipeline above is still the right pipeline; the model isn&rsquo;t there yet on
CPU-class options.</p>
<p>The next experiment is already picked: <a href="https://huggingface.co/Maxdorger29/f5-tts-hungarian">F5-TTS Hungarian</a>,
a 2026 fine-tune on 280 hours of actual Hungarian speech, built precisely
for short-sample cloning. It needs CUDA, which my node doesn&rsquo;t have — but a
rented spot GPU tests it for the price of an espresso. If it passes the
bedtime jury, that&rsquo;ll be its own post.</p>
<p>Negative results are results. The jury reconvenes when the GPU shows up.</p>
]]></content:encoded></item><item><title>🌱 My Second Brain Weeds Itself Now</title><link>https://blog.hippotion.com/posts/an-ai-gardener-for-your-second-brain/</link><pubDate>Fri, 27 Feb 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/an-ai-gardener-for-your-second-brain/</guid><description>I gave my markdown knowledge base a nightly gardener — an AI that finds orphan notes and missing links and fixes them, every change a reviewable git commit. The fun part was the Kubernetes wall I hit on the way.</description><content:encoded><![CDATA[<p>A few weeks ago I <a href="/posts/a-second-brain-you-can-git-clone/">rebuilt my second brain as a folder of markdown in git</a> — vault is the source of truth, everything else (search index, graph, 3D viewer) is a derived layer I can delete and rebuild. I love it. But a knowledge base has a dirty secret: <strong>it rots.</strong></p>
<p>Not the files — those are fine. The <em>connections</em> rot. You capture a note at 11pm and never link it to anything, so it becomes an orphan floating off the graph. A project note&rsquo;s one-line summary describes what the project was three weeks ago. Two notes are obviously about the same thing and neither knows the other exists. Do this for a few months and you don&rsquo;t have a second brain, you have a junk drawer with good search.</p>
<p>The honest fix is to weed the garden regularly. The honest truth is that nobody does, including me.</p>
<p>So I stopped relying on myself and built a gardener.</p>
<h2 id="what-it-actually-does">What it actually does</h2>
<p>Every night at 3am, on my homelab box, a script runs:</p>
<ol>
<li><strong>Detect</strong> — <code>exo garden</code>, a plain query over the index, produces a report: here are the orphans, here are notes that should probably link to each other, here are summaries that look stale. <strong>No AI in this step.</strong> It&rsquo;s SQL and graph traversal. Deterministic, boring, trustworthy.</li>
<li><strong>Decide and write</strong> — that report gets piped to <code>claude -p</code> (Claude Code in headless mode). Claude reads the vault&rsquo;s operating contract, makes <em>only high-confidence</em> edits — add a <code>[[wikilink]]</code> between two genuinely related notes, refresh a stale summary — caps itself at ~10 notes a night, and writes a dated log note explaining exactly what it changed and what it deliberately skipped.</li>
<li><strong>Commit</strong> — the wrapper reindexes and lands everything as a single <code>garden: 2026-06-09 …</code> git commit, then pushes. My 3D graph viewer picks it up on the next sync.</li>
</ol>
<p>The first real run, it found one orphan (<code>90-meta/README</code>), linked it into the notes it actually indexes, and then — this is the part I liked — <em>declined</em> to touch the 12 &ldquo;stale summary&rdquo; candidates because, on inspection, every one of them was already accurate. It wrote: <em>&ldquo;flagged by length, not staleness; churning them would add noise.&rdquo;</em> A gardener that knows when <strong>not</strong> to prune is the one you can leave alone.</p>
<h2 id="isnt-this-a-solved-problem">&ldquo;Isn&rsquo;t this a solved problem?&rdquo;</h2>
<p>Mostly, no — but partly, yes, and I want to be straight about it. AI-assisted note-linking exists: Obsidian plugins like Smart Connections suggest related notes, and apps like Mem and Reflect auto-organize as you write. They&rsquo;re good.</p>
<p>Three things make this different enough to build:</p>
<ul>
<li><strong>Every change is a reviewable git diff, authored by a named agent.</strong> Not silent magic that rearranges your notes while you&rsquo;re not looking. <code>git log -p</code> shows you exactly what the gardener did last night; <code>git revert</code> undoes a bad night in one command. For something as personal as a knowledge base, &ldquo;show me the diff&rdquo; beats &ldquo;trust me.&rdquo;</li>
<li><strong>It&rsquo;s mine, end to end.</strong> Runs on my hardware, on my schedule, with a model I point at. No SaaS holds my brain hostage.</li>
<li><strong>The detection is deterministic; the model only acts.</strong> The LLM never decides <em>what&rsquo;s wrong</em> — a boring query does that. The model only decides <em>how to fix the things already found</em>. That split keeps the whole thing auditable and cheap.</li>
</ul>
<p>If you already live in a tool that does this and you trust it, great. I wanted the git-diff trail and the local control.</p>
<h2 id="the-part-i-actually-want-to-tell-you-about">The part I actually want to tell you about</h2>
<p>The plan was tidy: I run n8n on the same cluster, so n8n would be the scheduler — fire nightly, <strong>SSH into the node</strong>, run the gardener. Clean, visual, one workflow.</p>
<p>n8n could not reach the node. At all. Every port: <code>ECONNREFUSED</code>.</p>
<p>This sent me down a genuinely interesting hole, because the homelab runs <strong>Cilium</strong> for networking, and Cilium has opinions about your own node that plain Kubernetes does not.</p>
<p>First instinct: a NetworkPolicy allowing egress to the node&rsquo;s IP. Wrote it, synced it, still refused. The reason is a Cilium subtlety worth knowing: <strong>the node isn&rsquo;t a CIDR, it&rsquo;s an identity.</strong> Cilium classifies your cluster&rsquo;s own node as the special <code>host</code> identity, and ordinary <code>ipBlock</code> CIDR rules <em>do not match it</em> unless you flip a cluster-wide setting (<code>policy-cidr-match-mode: nodes</code>). My <code>192.168.0.109/32</code> rule was a no-op.</p>
<p>So I switched to the Cilium-native tool: a <code>CiliumNetworkPolicy</code> with <code>toEntities: [host]</code>. Confirmed it applied — I could see <code>reserved:host</code> allowed right there in the datapath&rsquo;s BPF policy map. I confirmed the node&rsquo;s IP really does resolve to identity <code>1</code> (host). I confirmed the host firewall was <em>disabled</em>. Everything said &ldquo;allowed.&rdquo;</p>
<p>Still <code>ECONNREFUSED</code>.</p>
<p>That&rsquo;s the wall. The packet leaves the pod with Cilium&rsquo;s blessing, hits the host&rsquo;s own network stack, and <em>something there</em> sends a reset — and I couldn&rsquo;t see what, because inspecting the host firewall needs root, and this automation deliberately doesn&rsquo;t have it. I could have kept digging with a password. But I stopped and asked a better question: <strong>why am I making a pod reach back into the host it&rsquo;s running on at all?</strong></p>
<p>That&rsquo;s an awkward direction. The work has to happen <em>on</em> the host (that&rsquo;s where the vault, git creds, and Claude live). A pod straining to SSH into its own node is fighting the grain of the platform.</p>
<p>So I inverted it. <strong>The node schedules itself</strong> — a plain cron entry, rock-solid, no network gymnastics. And n8n, instead of <em>triggering</em> the job, <em>receives</em> it: at the end of each run the node POSTs a summary to an n8n webhook. Node→n8n works perfectly (it&rsquo;s just an outbound HTTPS call to a URL). n8n keeps the run history and is the place I&rsquo;ll later wire a phone notification.</p>
<p>I lost nothing that mattered. n8n is still my dashboard; the schedule just lives where the work lives. And I deleted the SSH key and the network-policy hole I&rsquo;d opened — the cleanup felt better than the original plan would have.</p>
<h2 id="the-lesson-such-as-it-is">The lesson, such as it is</h2>
<p>Two, actually.</p>
<p><strong>One:</strong> when you&rsquo;re automating something to run unattended, the bug you want to find is the one that shows up in a <em>dry run at 2pm</em>, not at <em>3am three weeks from now</em>. I almost shipped a version where a brand-new note (untracked by git) was invisible to my change-detection and would&rsquo;ve been silently wiped each night. The dry run caught it. Always build the dry run.</p>
<p><strong>Two, the bigger one:</strong> I spent an hour trying to make a pod punch into its host because that was <em>my</em> plan, and the platform kept saying no in increasingly specific ways. The fix wasn&rsquo;t a cleverer NetworkPolicy. It was noticing I was pushing against the design and turning around. The node scheduling itself and <em>reporting up</em> to n8n is simpler, safer, and more honest about where the work actually lives.</p>
<p>My brain weeds itself now. Every morning there&rsquo;s maybe one small, sensible commit waiting — a link I&rsquo;d have never made, a summary nudged back to true — and I can read exactly what changed before my coffee&rsquo;s done. That&rsquo;s the whole dream of a second brain that isn&rsquo;t a junk drawer: it stays a garden, and I barely have to touch it.</p>
]]></content:encoded></item><item><title>🎯 Know the Market Without Job-Hunting: An LLM-Scored Job Poller in n8n</title><link>https://blog.hippotion.com/posts/ats-job-poller/</link><pubDate>Fri, 13 Feb 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/ats-job-poller/</guid><description>You don&amp;rsquo;t have to be job-hunting to want to know your market — what&amp;rsquo;s out there, what it pays, where you&amp;rsquo;d fit. So I built an n8n workflow: it polls the public ATS APIs (Greenhouse/Lever/Ashby) plus a broad remote-jobs feed, filters for remote-EU infra roles, scores each posting against my CV with an LLM, and emails me only the 80%+ matches. No database, no scraping.</description><content:encoded><![CDATA[<p>You don&rsquo;t have to be about to change jobs to want to know the landscape. What&rsquo;s being built, what it
pays, where you&rsquo;d actually fit — staying current on the market (and your own worth) is just good
professional hygiene. The trouble is that <em>checking</em> is tedious, so most of us don&rsquo;t, until we&rsquo;re
already job-hunting and starting cold.</p>
<p>So I automated mine. An <a href="https://n8n.io">n8n</a> workflow on my homelab polls job boards every six hours,
scores each new posting against my profile with an LLM, and emails me only the strong matches — the
ones scoring 80%+. When it&rsquo;s quiet, it&rsquo;s silent. When something genuinely fits, I know the same day.
Here&rsquo;s what I learned building it. Repo at the bottom.</p>
<h2 id="three-apis-cover-most-of-the-market">Three APIs cover most of the market</h2>
<p>Company career pages look bespoke, but underneath, the vast majority run on one of three ATS — and
all three hand you the jobs as unauthenticated JSON:</p>
<ul>
<li><strong>Greenhouse</strong> — <code>boards-api.greenhouse.io/v1/boards/{token}/jobs?content=true</code></li>
<li><strong>Lever</strong> — <code>api.lever.co/v0/postings/{token}?mode=json</code></li>
<li><strong>Ashby</strong> — <code>api.ashbyhq.com/posting-api/job-board/{token}?includeCompensation=true</code></li>
</ul>
<p>No scraping, no headless browser. You poll the API the page itself calls, normalize the three
shapes into one <code>{ company, title, location, remote, url, posted_at, description, external_id }</code>, and
you&rsquo;re done with the hard part.</p>
<h2 id="resolve-the-token-is-half-the-battle">&ldquo;Resolve the token&rdquo; is half the battle</h2>
<p>The naive assumption — <em>the token is the company name, and everyone&rsquo;s on one of the three</em> — is half
right. When I probed my initial wishlist, <strong>roughly half 404&rsquo;d everywhere</strong>: HashiCorp (now under
IBM → Workday), SUSE (SuccessFactors), Aiven (Teamtailor), Hugging Face. They&rsquo;re on a fourth or fifth
system entirely. The honest move was to ship the ~33 that actually resolve and leave the rest as
disabled config stubs. Verify before you trust a slug.</p>
<h2 id="dedup-without-a-database">Dedup without a database</h2>
<p>I didn&rsquo;t want to stand up Postgres just to remember which jobs I&rsquo;d already seen. n8n&rsquo;s <strong>Data Tables</strong>
handle it natively: a <code>seen_jobs</code> table, an <code>external_id</code> namespaced <code>{ats}:{company}:{id}</code>, and the
<code>rowNotExists</code> operation drops anything already recorded. State lives inside n8n, backed up with it.
Zero extra infrastructure.</p>
<p>The ordering matters: <strong>notify first, mark seen second.</strong> The insert only happens after the email
sends, so a failed send retries next run instead of silently swallowing a posting.</p>
<h2 id="the-location-filter-is-a-trap">The location filter is a trap</h2>
<p>My first version kept everything that wasn&rsquo;t explicitly US-based. The inbox filled with <em>&ldquo;Senior
Platform Engineer — Spain (Remote)&rdquo;</em> and <em>&quot;… — United Kingdom (Remote)&quot;</em>. Those aren&rsquo;t remote-for-me
— they&rsquo;re remote <em>if you live in Spain</em>. Useless from where I sit.</p>
<p>The fix was to invert the logic. Keep only three things:</p>
<ul>
<li>globally-remote / worldwide / anywhere,</li>
<li>pan-EU (EMEA / Europe / EU / EEA),</li>
<li>my own country.</li>
</ul>
<p>…and <strong>drop single-country remote</strong>, even EU ones. Region and home matches win over the country
deny-list, ambiguous locations are kept (a missed match is worse than one extra line to skim). That
one change cut the noise more than anything else.</p>
<h2 id="let-an-llm-read-the-actual-job">Let an LLM read the actual job</h2>
<p>Keyword + location filtering gets you a candidate list, but it can&rsquo;t tell a &ldquo;Platform Engineer&rdquo; who
herds Kubernetes from a &ldquo;Platform Engineer&rdquo; who owns a Figma design system. The job description can.</p>
<p>So the last step scores each new posting against my CV. My first version batched all of them into
<strong>one</strong> big LLM call — which promptly timed out on the free tier. The fix was the opposite: <strong>one
small call per job</strong>, which also means a single slow or rate-limited job never sinks the batch. Each
call asks a <a href="https://build.nvidia.com">NVIDIA NIM</a> model (Llama 3.1 8B, OpenAI-compatible) for one
number and a reason:</p>
<blockquote>
<p>Score this job 0–100 for fit against my profile. Return <code>{score, reason}</code>.</p>
</blockquote>
<p>That score is what lets me <strong>widen the net instead of narrowing it.</strong> On top of the curated company
list I pull a broad remote-jobs feed (every company, all categories); the cheap keyword + location
filters do the first pass, then I <strong>only email the roles scoring 80%+.</strong> Casting wide is fine when a
model is the bar at the door. A line ends up looking like:</p>
<blockquote>
<p><strong>92%</strong> — <em>Grafana Labs</em> — Senior Platform Engineer (Remote, EMEA) — <em>strong k8s/GitOps overlap</em> — link</p>
</blockquote>
<p>Scoring is fail-safe: if a call hiccups, that job is just skipped, and every posting gets marked seen
either way — so nothing re-scores forever, and a rare bad run never floods or stalls the inbox.</p>
<h2 id="the-unglamorous-bits-that-make-it-trustworthy">The unglamorous bits that make it trustworthy</h2>
<ul>
<li><strong>One bad source can&rsquo;t kill the run</strong> — every fetch is wrapped; failures become a <code>⚠️ N sources failing</code> footer so a company quietly changing ATS is visible, not invisible.</li>
<li><strong>A prime run</strong> seeds the table silently the first time, so I&rsquo;m not buried under every currently-open
role on day one.</li>
<li><strong>Everything tunable lives in one Config node</strong> — companies, keywords, location lists, the profile,
the model — so adding a company is a one-line edit, not a graph safari.</li>
</ul>
<h2 id="takeaways">Takeaways</h2>
<ul>
<li>The &ldquo;scrape job boards&rdquo; problem mostly isn&rsquo;t a scraping problem — it&rsquo;s three public APIs and a
normalizer.</li>
<li>For personal automation, reach for the boring-but-correct primitive: native dedup state beats a
database you have to operate.</li>
<li>An LLM works best here as the <strong>bar at the door</strong>: cheap deterministic filters keep the candidate
set (and the cost) small, then the model gates on real fit — which is what lets you cast a wide net
without drowning in it.</li>
</ul>
<p>Workflow JSON, the full node-by-node breakdown, and setup notes:
<strong><a href="https://github.com/janos-gyorgy/ats-job-poller">github.com/janos-gyorgy/ats-job-poller</a></strong>.</p>
]]></content:encoded></item><item><title>🧠 A Second Brain You Can `git clone`</title><link>https://blog.hippotion.com/posts/a-second-brain-you-can-git-clone/</link><pubDate>Fri, 16 Jan 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/a-second-brain-you-can-git-clone/</guid><description>My first second brain died the way most do — on multi-device sync. The rebuild: plain markdown as the source of truth, every clever layer derived and disposable, and an AI that tends it through reviewable git diffs.</description><content:encoded><![CDATA[<h2 id="the-graveyard-of-second-brains">The graveyard of second brains</h2>
<p>I had a second brain once. Obsidian vault, a CouchDB LiveSync backend, even a
weekly agent that summarised my notes. It worked — for a while. Then the sync
started fighting itself across my laptop, the homelab, and my phone, and the day
syncing becomes a chore is the day you stop opening the thing. The notes were
still there. I just never looked at them again.</p>
<p>That&rsquo;s how most second brains die. Not from bad notes — from the <em>plumbing</em>. The
sync breaks, or the upkeep outpaces the payoff, or the whole thing is trapped in
one app&rsquo;s database and moving it feels like surgery. The knowledge was never the
problem. The container was.</p>
<p>So when I rebuilt it, I started from the failure modes, not the features.</p>
<h2 id="what-i-actually-wanted">What I actually wanted</h2>
<p>Three things, none of them &ldquo;more notes&rdquo;:</p>
<ol>
<li><strong>Memory I share with my AIs.</strong> Every time I open a fresh Claude session, it
starts from zero — I re-explain my homelab, my projects, what we decided last
week. I wanted a place both of us read <em>and</em> write, so the context survives the
session.</li>
<li><strong>Something that outlives any tool.</strong> No lock-in. If the app of the month dies,
my brain shouldn&rsquo;t die with it.</li>
<li><strong>Sync that can&rsquo;t rot.</strong> The thing that killed v1.</li>
</ol>
<h2 id="the-one-decision-that-matters">The one decision that matters</h2>
<p><strong>The store and the intelligence are different layers, and only the store is
sacred.</strong></p>
<p>The store is a folder of plain markdown in git. That&rsquo;s it. Human-readable, diffable,
greppable, yours. Everything clever sits <em>above</em> it and is fully rebuildable:</p>
<pre tabindex="0"><code>L5  Visualisation   3D graph, Obsidian, whatever reads markdown
L4  Automation      scheduled &#34;gardener&#34; runs
L3  Agent interface MCP servers — search, graph, note CRUD
L2  Index           SQLite: full-text + vectors + materialised edges
L1  Structure       typed frontmatter + [[wikilinks]]
L0  Substrate       markdown files in git   ← the only thing that&#39;s truth
</code></pre><p>Delete L1–L5 and nothing is lost — you rebuild them from L0 with one command.
That property is the whole design. The index can corrupt, the embedding model can
change, the viewer can break (mine did, spectacularly — that&rsquo;s another post), and
the knowledge doesn&rsquo;t care. It&rsquo;s text in git.</p>
<p>And <strong>sync is just <code>git pull</code>.</strong> No LiveSync daemon to wedge itself, no proprietary
replication. The exact thing that killed v1 is now the most boring, battle-tested
part of the stack. Three devices, one <code>git pull</code>, done.</p>
<h2 id="search-that-explains-itself">Search that explains itself</h2>
<p>The retrieval layer is deliberately not &ldquo;throw it all at embeddings.&rdquo; It fuses
three signals — keyword (BM25), vector similarity, and graph expansion (pull in
the neighbours of strong hits) — and every result reports <em>which signals fired</em>.</p>
<pre tabindex="0"><code>exo search &#34;hybrid retrieval&#34;
→ hybrid-retrieval   matched_on: [bm25, graph]
</code></pre><p>That <code>matched_on</code> matters more than it looks. An embeddings-only system gives you
a ranked list and no reason — you can&rsquo;t tell a real match from a vibe. For a brain
I&rsquo;m supposed to trust over years, &ldquo;why did this surface?&rdquo; is a feature, not a
nicety.</p>
<h2 id="the-ai-is-a-librarian-not-a-hoarder">The AI is a librarian, not a hoarder</h2>
<p>Here&rsquo;s the part I care about most. The AI doesn&rsquo;t just <em>read</em> the brain — it
writes to it. Through an MCP server it can search, walk the graph, and author
notes. But under a hard rule: <strong>every write is a reviewable git diff.</strong></p>
<p>It searches before it writes (extend a note, don&rsquo;t spawn a duplicate). It links
instead of piling. A scheduled &ldquo;gardener&rdquo; pass finds orphaned notes and stale
summaries and proposes fixes — as commits I can read and <code>git revert</code> if it gets
something wrong. No black-box mutation of my memory. Just a librarian that files
things while I&rsquo;m asleep and leaves a paper trail.</p>
<p>So now &ldquo;what am I building?&rdquo; is a question with an instant, honest answer: a single
map note, kept current, that every project links into. I ask, the AI pulls it, and
neither of us has to remember.</p>
<h2 id="why-not-just">Why not just…</h2>
<ul>
<li><strong>Obsidian alone?</strong> It&rsquo;s a lovely <em>viewer</em> — and I still use it as one. But it
can&rsquo;t give an agent structured read/write or explainable retrieval, and its sync
is what burned me. Here Obsidian reads the same markdown; it&rsquo;s a window, not the
house.</li>
<li><strong>Embeddings RAG?</strong> Opaque and one-directional. It can rank, but it can&rsquo;t tell
you why, and it can&rsquo;t write back. This is transparent and bidirectional.</li>
<li><strong>Notion / a SaaS brain?</strong> Lock-in by design. <code>git clone</code> is my backup and any
text editor is my fallback.</li>
<li><strong>A graph database?</strong> Unnecessary infra. The graph lives in the wikilinks; SQLite
just materialises it. I&rsquo;ll add Neo4j the day my queries actually outgrow a single
file, and not a day sooner.</li>
</ul>
<h2 id="what-it-changes">What it changes</h2>
<p>The vault is small still — that&rsquo;s fine; it grows by use. But the loop already
pays off: I work, the AI checkpoints decisions into markdown, and the <em>next</em>
session — fresh model, no memory of its own — searches the brain and is caught up
in seconds. The knowledge stopped living only in my head and in dead chat logs.</p>
<p>I&rsquo;m a team of one. There&rsquo;s no colleague who remembers why I made a call six months
ago, no handover doc someone else maintains. Continuity isn&rsquo;t a nice-to-have; it&rsquo;s
the whole job. A second brain that the AI helps keep alive — and that I can
<code>git clone</code> onto any machine in thirty seconds — is the first version of this idea
that I actually trust to still be here in five years.</p>
<p>The notes from v1? They&rsquo;re sitting in a folder, waiting to be triaged into v2. This
time I&rsquo;ll still be opening it.</p>
]]></content:encoded></item><item><title>🍵 I A/B-Tested Cloud vs Local LLMs in One n8n Agent. The Local One Faked It.</title><link>https://blog.hippotion.com/posts/n8n-agent-cloud-vs-local/</link><pubDate>Fri, 07 Nov 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/n8n-agent-cloud-vs-local/</guid><description>I built an AI agent in self-hosted n8n over my kombucha-tracking app, then gave it two brains — NVIDIA&amp;rsquo;s 70B and a local Phi-3.5 — sharing the same tools. The cloud model called the tools and answered from real data. The local one couldn&amp;rsquo;t, so it made things up.</description><content:encoded><![CDATA[<h2 id="the-question">The question</h2>
<p>I run <a href="https://n8n.io">n8n</a> on my k3s homelab. Not docker-compose on a NUC — the full treatment: GitOps-reconciled, Vault-backed secrets, default-deny networking. The same boring platform everything else here runs on.</p>
<p>But &ldquo;I have n8n running&rdquo; proves nothing. I wanted to know if I actually understood it as an <em>agent platform</em>, and to answer a question I kept dodging: <strong>for agent work, do I need a cloud model, or is my local one good enough?</strong></p>
<p>So I built a real agent and gave it two brains.</p>
<h2 id="what-i-built">What I built</h2>
<p>A chat assistant over brew-buddy, my homemade kombucha-tracking app (React + a small API + Postgres). You ask it things in plain language; it calls the app&rsquo;s API and answers. The twist: the same question runs through <strong>two agents in parallel</strong> — one backed by NVIDIA&rsquo;s hosted <strong>Llama-3.3-70B</strong>, one by a local <strong>Phi-3.5-mini</strong> on CPU — and the workflow prints both answers side by side.</p>
<pre tabindex="0"><code>Chat ──▶ Agent (cloud: NVIDIA 70B) ──┐   tools (shared):
     └─▶ Agent (local: Phi-3.5)   ──┤     • get_all_batches
                                    │     • get_batch_detail
                                    │     • brewing_statistics
            (Merge) ──▶ both replies, labeled     • add_batch_log   ⟵ write
                                                  • create_batch    ⟵ write
</code></pre><p>Both agents share the same read tools. The two <em>write</em> tools are wired to the cloud agent only — more on that below.</p>
<p><img alt="The kombucha agent in n8n: a chat trigger fans out to two AI Agent nodes (cloud and local), both wired to the same brew-buddy tools, then merged so the two answers print side by side." loading="lazy" src="/posts/n8n-agent-cloud-vs-local/n8n.png"></p>
<p>The nice part: I didn&rsquo;t write a line of glue. n8n&rsquo;s stock <strong>OpenAI Chat Model</strong> node talks to anything OpenAI-compatible if you override the credential&rsquo;s Base URL — so one node points at <code>https://integrate.api.nvidia.com/v1</code>, the other at <code>http://llama-server.&lt;ns&gt;.svc:8080/v1</code> for the local server. Same node, two endpoints.</p>
<h2 id="the-infra-that-keeps-it-honest">The infra that keeps it honest</h2>
<p>I won&rsquo;t re-explain the platform here — it&rsquo;s in earlier posts: <a href="/posts/homelab-gitops/">GitOps</a>, <a href="/posts/k8s-gitops-secrets/">Vault-backed secrets</a>, <a href="/posts/k8s-network-isolation/">default-deny networking</a>, <a href="/posts/homelab-dual-path-tls/">dual-path TLS ingress</a>. But building the agent made one of them <em>tangible</em>.</p>
<p>n8n is, by design, a thing that makes arbitrary HTTP calls on a schedule. That&rsquo;s exactly what you want behind a default-deny network policy. n8n couldn&rsquo;t reach the brew-buddy API at all until I declared it — one line:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># n8n&#39;s namespace</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">allowEgressToNamespaces</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">web-ai-engine, web-brew-buddy]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c">#                                          ^ added this for the agent</span><span class="w">
</span></span></span></code></pre></div><p>(plus a matching ingress-allow on brew-buddy&rsquo;s side). That&rsquo;s the posture working as intended: the blast radius of a workflow tool is whatever I&rsquo;ve explicitly granted, and not one namespace more. Adding a capability is a reviewable one-liner in Git; Argo reconciles it. No <code>kubectl</code>, no guessing what n8n can reach.</p>
<h2 id="the-ab-same-agent-same-tools-two-brains">The A/B: same agent, same tools, two brains</h2>
<p><strong>Plain &ldquo;hi&rdquo;.</strong> Cloud answers in ~0.5s. Local takes noticeably longer — because even for &ldquo;hi&rdquo;, the agent feeds the model the full system prompt <em>plus the JSON schemas for every tool</em>, and Phi-3.5 has to chew through all of it on CPU before it can say a word. So far, the boring expected result: local is slower.</p>
<p>Then I asked a real question, and the result flipped in a way I didn&rsquo;t expect.</p>
<p><strong>&ldquo;What batches do I have?&rdquo;</strong></p>
<p>Cloud (70B) called <code>get_all_batches</code>, got the real rows, and answered:</p>
<blockquote>
<p>You have two batches: 2026-04-09-A (cold-crash, 3L) and 2026-04-09-W (cold-crash, 3L).</p>
</blockquote>
<p>Local (Phi-3.5) <strong>never called the tool.</strong> It didn&rsquo;t seem to realise it <em>had</em> tools. Instead it confidently explained how <em>I</em> could go find the data myself:</p>
<blockquote>
<p>To list all batches: 1. Access the brew-buddy app. 2. Look for a button labeled &ldquo;List Batches&rdquo;… <code>def get_all_batches(): …</code> … Remember, I&rsquo;m unable to directly interact with apps or databases.</p>
</blockquote>
<p>Fake instructions. Fake code. A polite apology. Everything except the actual answer it was sitting on top of.</p>
<p><strong>Writing data.</strong> I asked both to <em>log</em> an observation. Cloud called <code>add_batch_log</code> and wrote a real row to Postgres (&ldquo;I have recorded the observation…&rdquo;). Local bluffed again — &ldquo;here&rsquo;s how <em>you</em> can log it yourself.&rdquo;</p>
<h2 id="why-it-matters-capability-not-latency">Why it matters: capability, not latency</h2>
<p>The interesting finding isn&rsquo;t &ldquo;the big model is better.&rdquo; It&rsquo;s <em>how</em> the small one fails.</p>
<p>With a ~3.8B model on CPU, the bottleneck for agent work isn&rsquo;t speed — it&rsquo;s <strong>capability</strong>. Phi-3.5 couldn&rsquo;t reliably emit tool calls, so n8n&rsquo;s tools never fired, and the model degraded into a chatbot that <strong>hallucinates a plausible answer instead of fetching the real one.</strong> That failure mode is worse than an error: an error you catch, a confident wrong answer you ship.</p>
<p>A couple of measurements that sharpened it:</p>
<ul>
<li>NVIDIA 70B, <strong>plain chat</strong>: ~0.5s.</li>
<li>NVIDIA 70B, <strong>function-calling</strong> (with tool schemas): ~8.6s per round-trip — and an agent makes several round-trips per answer. That&rsquo;s real latency you have to budget a timeout for. (It&rsquo;s also why the cloud side initially <em>timed out</em> in n8n until I raised the model node&rsquo;s timeout — the model was fine, n8n was cutting it off.)</li>
</ul>
<p>So the snappy-vs-slow comparison <strong>flips depending on whether the question triggers tools</strong>. Plain chat: cloud wins on speed. Tool use: the local model is &ldquo;fast&rdquo; only because it skips the tools and makes something up. Speed was never the real axis.</p>
<p>The honest caveat: this is <em>this</em> small general model in a multi-tool agent loop. Purpose-built small models with tool-calling fine-tunes do better at narrow tasks — I run a 1.7B one elsewhere that emits a single structured tool call just fine. But for &ldquo;pick the right tool from several and chain them,&rdquo; 70B was in a different league.</p>
<h2 id="the-trust-boundary">The trust boundary</h2>
<p>I gave the write tools (<code>add_batch_log</code>, <code>create_batch</code>) to the cloud agent <strong>only</strong>. The local agent is read-only — not by instruction, by wiring. Even if Phi-3.5 <em>did</em> decide to call a write tool, the connection isn&rsquo;t there. The reliable model is the only one allowed to mutate real data, and that&rsquo;s enforced structurally, not by trusting a prompt.</p>
<h2 id="whats-toy-and-whats-real">What&rsquo;s toy and what&rsquo;s real</h2>
<p>Worth being straight: this is a <strong>single-node homelab</strong>. The agent and both model paths share one box. Running n8n on Kubernetes and swapping models isn&rsquo;t novel — <a href="https://docs.n8n.io/hosting/scaling/queue-mode/">n8n&rsquo;s own docs</a> cover queue mode, where a main instance fans work out to a pool of worker pods you scale horizontally, with external Postgres for state. That&rsquo;s the real production shape. Mine is one replica with an emptyDir&rsquo;s worth of ambition.</p>
<p>What I think <em>is</em> worth sharing is the finding (the capability cliff, and that its failure mode is confident fabrication) and the boring thing underneath it: because the platform is default-deny and GitOps-reconciled, running this experiment cost me one reviewable egress line and zero risk to anything else.</p>
<h2 id="the-boring-part-is-the-point">The boring part is the point</h2>
<p>The AI was the fun bit. But the reason I could bolt an agent onto a live cluster, point it at a real app, give it write access to one model and not the other, and tear it all down again — without worrying what it might touch — is that the infrastructure was already boring. Default-deny. Secrets out of Git. <code>git push</code>, Argo reconciles.</p>
<p>The model picks the tools. The platform decides what the tools can reach. Keep those two honest about each other and self-hosting an agent stops being scary and starts being just another app.</p>
]]></content:encoded></item><item><title>🔒 Building a PII Guardrail Proxy for Cloud LLM Calls</title><link>https://blog.hippotion.com/posts/ai-pii-guardrail-proxy/</link><pubDate>Fri, 26 Sep 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/ai-pii-guardrail-proxy/</guid><description>A local model classifies every prompt before it leaves the cluster. If it&amp;rsquo;s sensitive, it&amp;rsquo;s blocked. If it&amp;rsquo;s clean, it goes to NVIDIA NIM. 150 lines of FastAPI, deployed on k3s.</description><content:encoded><![CDATA[<h2 id="the-problem-with-cloud-llm-access">The problem with cloud LLM access</h2>
<p>Running a local model is great for privacy. But local models hit a ceiling — for the heavy lifting, you want a cloud API like NVIDIA NIM with Llama 3.3 70B.</p>
<p>The moment you open that channel, you have a new risk: what if someone (or some automation) accidentally pastes a password, a private key, or someone&rsquo;s personal data into the chat? It leaves the cluster. It&rsquo;s logged somewhere you don&rsquo;t control.</p>
<p>The standard answer is &ldquo;train your users.&rdquo; I&rsquo;d rather have a technical control.</p>
<h2 id="the-architecture">The architecture</h2>
<pre tabindex="0"><code>Open WebUI → ai-guard proxy
                 │
        ┌────────┴────────┐
        │                 │
  llama-server       if SAFE:
  (classify)         forward to NVIDIA NIM
        │
   if SENSITIVE:
   block + explain
</code></pre><p>Every request to NVIDIA NIM goes through ai-guard first. ai-guard pulls the user message, sends it to the local llama.cpp server with a classification prompt, and makes a binary decision:</p>
<ul>
<li><code>SAFE</code> → forward to NVIDIA NIM with the real API key (which ai-guard holds, not the client)</li>
<li><code>SENSITIVE: &lt;reason&gt;</code> → return HTTP 400, log the block, nothing leaves the cluster</li>
</ul>
<p>The local model is already running for inference — this reuses it as a privacy gatekeeper at zero extra infrastructure cost.</p>
<h2 id="the-implementation">The implementation</h2>
<p>The proxy is ~150 lines of FastAPI. The classifier call:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">CLASSIFIER_PROMPT</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;You are a data security classifier. Check if the text below contains sensitive information:
</span></span></span><span class="line"><span class="cl"><span class="s2">passwords, API keys, tokens, credentials, personal identifiable information (names, emails, phone numbers, SSNs, addresses), financial data (card numbers, bank accounts), or private keys.
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">Reply with ONLY one of:
</span></span></span><span class="line"><span class="cl"><span class="s2">SAFE
</span></span></span><span class="line"><span class="cl"><span class="s2">SENSITIVE: &lt;one-line reason&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">Text to check:
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">async</span> <span class="k">def</span> <span class="nf">classify</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="nb">bool</span><span class="p">,</span> <span class="nb">str</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">    <span class="k">async</span> <span class="k">with</span> <span class="n">httpx</span><span class="o">.</span><span class="n">AsyncClient</span><span class="p">(</span><span class="n">timeout</span><span class="o">=</span><span class="mi">60</span><span class="p">)</span> <span class="k">as</span> <span class="n">client</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">resp</span> <span class="o">=</span> <span class="k">await</span> <span class="n">client</span><span class="o">.</span><span class="n">post</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">LLAMA_BASE</span><span class="si">}</span><span class="s2">/chat/completions&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">json</span><span class="o">=</span><span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;model&#34;</span><span class="p">:</span> <span class="s2">&#34;phi-3.5-mini&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;messages&#34;</span><span class="p">:</span> <span class="p">[{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">CLASSIFIER_PROMPT</span> <span class="o">+</span> <span class="n">text</span><span class="p">[:</span><span class="mi">3000</span><span class="p">]}],</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;max_tokens&#34;</span><span class="p">:</span> <span class="mi">30</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;temperature&#34;</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;stream&#34;</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="p">},</span>
</span></span><span class="line"><span class="cl">            <span class="n">headers</span><span class="o">=</span><span class="p">{</span><span class="s2">&#34;Authorization&#34;</span><span class="p">:</span> <span class="s2">&#34;Bearer sk-no-key&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">answer</span> <span class="o">=</span> <span class="n">resp</span><span class="o">.</span><span class="n">json</span><span class="p">()[</span><span class="s2">&#34;choices&#34;</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="s2">&#34;message&#34;</span><span class="p">][</span><span class="s2">&#34;content&#34;</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">answer</span><span class="o">.</span><span class="n">upper</span><span class="p">()</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s2">&#34;SENSITIVE&#34;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">reason</span> <span class="o">=</span> <span class="n">answer</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">&#34;:&#34;</span><span class="p">,</span> <span class="mi">1</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> <span class="k">if</span> <span class="s2">&#34;:&#34;</span> <span class="ow">in</span> <span class="n">answer</span> <span class="k">else</span> <span class="s2">&#34;sensitive content detected&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span><span class="p">,</span> <span class="n">reason</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">False</span><span class="p">,</span> <span class="s2">&#34;&#34;</span>
</span></span></code></pre></div><p><code>temperature=0</code> and <code>max_tokens=30</code> keep the response deterministic and fast. The model only needs to output one word or one line.</p>
<p>The main handler:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="nd">@app.post</span><span class="p">(</span><span class="s2">&#34;/v1/chat/completions&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">async</span> <span class="k">def</span> <span class="nf">proxy_chat</span><span class="p">(</span><span class="n">request</span><span class="p">:</span> <span class="n">Request</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">body</span> <span class="o">=</span> <span class="k">await</span> <span class="n">request</span><span class="o">.</span><span class="n">json</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">user_text</span> <span class="o">=</span> <span class="n">extract_user_text</span><span class="p">(</span><span class="n">body</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&#34;messages&#34;</span><span class="p">,</span> <span class="p">[]))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">user_text</span><span class="o">.</span><span class="n">strip</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">        <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">is_sensitive</span><span class="p">,</span> <span class="n">reason</span> <span class="o">=</span> <span class="k">await</span> <span class="n">classify</span><span class="p">(</span><span class="n">user_text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">exc</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">log</span><span class="o">.</span><span class="n">error</span><span class="p">(</span><span class="s2">&#34;classifier error: </span><span class="si">%s</span><span class="s2"> — allowing request through&#34;</span><span class="p">,</span> <span class="n">exc</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">is_sensitive</span> <span class="o">=</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">is_sensitive</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">JSONResponse</span><span class="p">(</span><span class="n">status_code</span><span class="o">=</span><span class="mi">400</span><span class="p">,</span> <span class="n">content</span><span class="o">=</span><span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;error&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;message&#34;</span><span class="p">:</span> <span class="sa">f</span><span class="s2">&#34;Request blocked by ai-guard: </span><span class="si">{</span><span class="n">reason</span><span class="si">}</span><span class="s2">. Remove sensitive content before sending to external models.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;content_policy_violation&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="p">}</span>
</span></span><span class="line"><span class="cl">            <span class="p">})</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Safe — forward to upstream with streaming support</span>
</span></span><span class="line"><span class="cl">    <span class="o">...</span>
</span></span></code></pre></div><p>Fail-open: if the classifier itself errors (llama-server down, timeout), the request goes through and the error is logged. Fail-closed would be safer for high-stakes environments, but this is a homelab and I&rsquo;d rather not block all cloud LLM access because the local model is warming up.</p>
<h2 id="kubernetes-deployment">Kubernetes deployment</h2>
<p>ai-guard runs in the same namespace as llama-server and Open WebUI (<code>web-ai-engine</code>). Intra-namespace traffic is always allowed in Cilium, so no new network policy needed.</p>
<p>Open WebUI uses semicolon-separated lists for multiple API backends:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl">- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">OPENAI_API_BASE_URLS</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">value</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;http://llama-server.web-ai-engine.svc:8080/v1;http://ai-guard.web-ai-engine.svc:8080/v1&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl">- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">OPENAI_API_KEYS</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">value</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;sk-no-key;sk-no-key&#34;</span><span class="w">
</span></span></span></code></pre></div><p>The second entry is ai-guard. Open WebUI passes <code>sk-no-key</code> as the API key — ai-guard ignores it and uses its own <code>UPSTREAM_API_KEY</code> from a Kubernetes Secret (pulled from Vault via External Secrets Operator). The real NVIDIA API key never touches the client.</p>
<h2 id="the-latency-tradeoff">The latency tradeoff</h2>
<p>The classification step adds 5–15 seconds on CPU inference. That&rsquo;s the cost of keeping the check fully private — the classifier never sends data anywhere.</p>
<p>For a personal homelab assistant, this is fine. For a high-throughput production setup, you&rsquo;d want the classifier on a GPU or a dedicated smaller model purpose-built for classification.</p>
<h2 id="what-it-catches">What it catches</h2>
<p>The classifier prompt targets:</p>
<ul>
<li>Passwords, API keys, tokens, credentials</li>
<li>PII: names, emails, phone numbers, SSNs, addresses</li>
<li>Financial data: card numbers, bank accounts</li>
<li>Private keys</li>
</ul>
<p>False negatives are possible — no classifier is perfect. This is a first line of defense, not a compliance control. The value is catching the obvious, accidental leaks.</p>
<h2 id="source">Source</h2>
<p><a href="https://github.com/janos-gyorgy/ai-guard">github.com/janos-gyorgy/ai-guard</a> — MIT licensed, Kubernetes manifests included.</p>
]]></content:encoded></item><item><title>🕵️ Privacy-Preserving LLM Pipelines: Anonymize Before You Send</title><link>https://blog.hippotion.com/posts/llm-anonymizer-privacy-pipeline/</link><pubDate>Fri, 12 Sep 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/llm-anonymizer-privacy-pipeline/</guid><description>Replace PII with semantically realistic fakes before sending to a cloud LLM, then restore the originals from the response. Started with a general model and prompt engineering — then upgraded to a purpose-built 1.7B fine-tune via Ollama.</description><content:encoded><![CDATA[<h2 id="the-problem-with-blocking">The problem with blocking</h2>
<p>The <a href="/posts/ai-pii-guardrail-proxy/">PII guardrail proxy I built last week</a> works by classifying prompts and blocking the sensitive ones. That&rsquo;s fine for a chat interface where a human can rephrase. It doesn&rsquo;t work for automated pipelines.</p>
<p>If a Jira ticket contains someone&rsquo;s name and an internal hostname, you don&rsquo;t want the agent to fail — you want it to process the ticket without exposing that data. Blocking is the wrong primitive for pipelines. Anonymization is the right one.</p>
<h2 id="the-pattern">The pattern</h2>
<pre tabindex="0"><code>Input text
  → anonymizer: extract PII, replace with semantic fakes
  → &#34;Nathan Chen from DataSoft LLC needs ProjectX fixed on dev.internal.net&#34;
  + mapping: {&#34;Nathan Chen&#34; → &#34;John Smith&#34;, &#34;DataSoft LLC&#34; → &#34;ACME&#34;, ...}
  → cloud LLM: processes coherent text, never sees real values
  → &#34;Nathan Chen should check the ProjectX docs with the DataSoft LLC team&#34;
  → string substitution with reverse mapping
  → &#34;John Smith should check the OAuth docs with the ACME team&#34;
</code></pre><p>Two things that make this work:</p>
<p><strong>Deanonymization needs no LLM.</strong> Once you have the mapping, restoring is pure string substitution. The model call only happens on the way in.</p>
<p><strong>Semantic fakes beat placeholder tokens.</strong> An earlier version of this used <code>[PERSON_1]</code>, <code>[ORG_1]</code> tokens. The problem: cloud models see bracketed text and subtly change behaviour — shorter responses, hedging, dropped context. When the cloud model sees <code>Nathan Chen from DataSoft LLC</code>, it treats it as real text and responds naturally. Quality is noticeably better.</p>
<h2 id="prior-art--what-already-exists">Prior art — what already exists</h2>
<p>This is a well-established pattern. Worth knowing what&rsquo;s out there:</p>
<p><strong><a href="https://llm-guard.com/output_scanners/deanonymize/">LLM Guard</a></strong> (Protect AI) — the most complete open-source implementation. Anonymize + Deanonymize scanner pair with a Vault for the mapping. Production-grade, actively maintained. Start here if you&rsquo;re building this for anything serious.</p>
<p><strong><a href="https://techcommunity.microsoft.com/blog/azuredevcommunityblog/introducing-pii-shield-a-privacy-proxy-for-every-llm-call/4514726">Microsoft PII Shield</a></strong> — session-based proxy. Returns a session ID with the anonymized text, uses it to deanonymize the response.</p>
<p><strong><a href="https://github.com/fsndzomga/anonLLM">anonLLM</a></strong> — uses GLiNER (a proper NER model) + Faker for realistic replacements. Better accuracy than a general chat model.</p>
<p><strong><a href="https://ieeexplore.ieee.org/document/11140717/">REDACT</a></strong> — IEEE paper describing a system using Ollama for PII redaction in documents.</p>
<p><strong><a href="https://huggingface.co/blog/pratyushrt/anonymizerslm">HuggingFace Anonymizer SLM series</a></strong> — purpose-built models (0.6B/1.7B/4B) fine-tuned specifically for anonymization. 9.20/10 quality score for 1.7B, close to GPT-4.1&rsquo;s 9.77.</p>
<p>That last one is what this implementation actually uses.</p>
<h2 id="the-model-anonymizer-17b">The model: Anonymizer-1.7B</h2>
<p><a href="https://huggingface.co/eternisai/Anonymizer-1.7B">eternisai/Anonymizer-1.7B</a> is a Qwen3-1.7B fine-tune trained on ~30k anonymization samples using GRPO with GPT-4.1 as judge. It outputs structured tool calls instead of free text:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;replace_entities&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;arguments&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;replacements&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">      <span class="p">{</span><span class="nt">&#34;original&#34;</span><span class="p">:</span> <span class="s2">&#34;John Smith&#34;</span><span class="p">,</span> <span class="nt">&#34;replacement&#34;</span><span class="p">:</span> <span class="s2">&#34;Nathan Chen&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">      <span class="p">{</span><span class="nt">&#34;original&#34;</span><span class="p">:</span> <span class="s2">&#34;ACME Corp&#34;</span><span class="p">,</span> <span class="nt">&#34;replacement&#34;</span><span class="p">:</span> <span class="s2">&#34;DataSoft LLC&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">      <span class="p">{</span><span class="nt">&#34;original&#34;</span><span class="p">:</span> <span class="s2">&#34;auth.acme.internal&#34;</span><span class="p">,</span> <span class="nt">&#34;replacement&#34;</span><span class="p">:</span> <span class="s2">&#34;dev.internal.net&#34;</span><span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>No prompt engineering needed. The model knows exactly what it&rsquo;s doing and outputs a structured contract. Compare that to the first version of this service, which sent a long JSON-format prompt to Phi-3.5-mini and hoped the output parsed correctly.</p>
<p>The model runs via Ollama (which handles the Qwen3 chat template and tool calling natively), pointed at the GGUF version from HuggingFace: <code>hf.co/gabriellarson/Anonymizer-1.7B-GGUF</code>.</p>
<h2 id="the-implementation">The implementation</h2>
<p><code>llm-anonymizer</code> is a FastAPI service with two endpoints.</p>
<p><strong><code>POST /anonymize</code></strong> — calls Ollama with the tool definition, parses the response:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">TOOLS</span> <span class="o">=</span> <span class="p">[{</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;function&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;function&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;replace_entities&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;description&#34;</span><span class="p">:</span> <span class="s2">&#34;Replace PII entities with anonymized versions&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;parameters&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;object&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;properties&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;replacements&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;array&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;items&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                        <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;object&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                        <span class="s2">&#34;properties&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                            <span class="s2">&#34;original&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;string&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">                            <span class="s2">&#34;replacement&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;string&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">                        <span class="p">},</span>
</span></span><span class="line"><span class="cl">                        <span class="s2">&#34;required&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;original&#34;</span><span class="p">,</span> <span class="s2">&#34;replacement&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">                    <span class="p">},</span>
</span></span><span class="line"><span class="cl">                <span class="p">}</span>
</span></span><span class="line"><span class="cl">            <span class="p">},</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;required&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;replacements&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">        <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl"><span class="p">}]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">resp</span> <span class="o">=</span> <span class="k">await</span> <span class="n">client</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">OLLAMA_BASE</span><span class="si">}</span><span class="s2">/api/chat&#34;</span><span class="p">,</span> <span class="n">json</span><span class="o">=</span><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;model&#34;</span><span class="p">:</span> <span class="n">MODEL</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;messages&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;system&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">SYSTEM_PROMPT</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">        <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">text</span> <span class="o">+</span> <span class="s2">&#34;</span><span class="se">\n</span><span class="s2">/no_think&#34;</span><span class="p">},</span>  <span class="c1"># skip Qwen3 thinking mode</span>
</span></span><span class="line"><span class="cl">    <span class="p">],</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;tools&#34;</span><span class="p">:</span> <span class="n">TOOLS</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;stream&#34;</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">})</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">tool_calls</span> <span class="o">=</span> <span class="n">resp</span><span class="o">.</span><span class="n">json</span><span class="p">()[</span><span class="s2">&#34;message&#34;</span><span class="p">][</span><span class="s2">&#34;tool_calls&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="n">replacements</span> <span class="o">=</span> <span class="n">tool_calls</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s2">&#34;function&#34;</span><span class="p">][</span><span class="s2">&#34;arguments&#34;</span><span class="p">][</span><span class="s2">&#34;replacements&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Build reverse mapping: replacement → original (for deanonymization)</span>
</span></span><span class="line"><span class="cl"><span class="n">anonymized</span> <span class="o">=</span> <span class="n">text</span>
</span></span><span class="line"><span class="cl"><span class="n">mapping</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">pair</span> <span class="ow">in</span> <span class="n">replacements</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">anonymized</span> <span class="o">=</span> <span class="n">anonymized</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">pair</span><span class="p">[</span><span class="s2">&#34;original&#34;</span><span class="p">],</span> <span class="n">pair</span><span class="p">[</span><span class="s2">&#34;replacement&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">    <span class="n">mapping</span><span class="p">[</span><span class="n">pair</span><span class="p">[</span><span class="s2">&#34;replacement&#34;</span><span class="p">]]</span> <span class="o">=</span> <span class="n">pair</span><span class="p">[</span><span class="s2">&#34;original&#34;</span><span class="p">]</span>
</span></span></code></pre></div><p>The <code>/no_think</code> suffix tells the model to skip its chain-of-thought — faster response, same accuracy for this task.</p>
<p><strong><code>POST /deanonymize</code></strong> — no model call, just substitution:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">for</span> <span class="n">replacement</span><span class="p">,</span> <span class="n">original</span> <span class="ow">in</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">mapping</span><span class="o">.</span><span class="n">items</span><span class="p">(),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">replacement</span><span class="p">,</span> <span class="n">original</span><span class="p">)</span>
</span></span></code></pre></div><p>Sorted by length descending so longer tokens don&rsquo;t get partially overwritten by shorter ones.</p>
<h2 id="the-kubernetes-stack">The Kubernetes stack</h2>
<p>Ollama runs as a separate deployment in the same namespace as everything else (<code>web-ai-engine</code>). Intra-namespace traffic is always allowed — no new network policies.</p>
<pre tabindex="0"><code>llm-anonymizer (FastAPI) → Ollama (port 11434) → Anonymizer-1.7B GGUF
</code></pre><p>One-time model pull after first deploy:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">kubectl <span class="nb">exec</span> -n web-ai-engine deploy/ollama -- <span class="se">\
</span></span></span><span class="line"><span class="cl">  ollama pull hf.co/gabriellarson/Anonymizer-1.7B-GGUF
</span></span></code></pre></div><p>Ollama caches it on a 10Gi PVC, so pod restarts don&rsquo;t re-download.</p>
<h2 id="the-n8n-pipeline">The n8n pipeline</h2>
<p>Five-node chain triggered by webhook:</p>
<pre tabindex="0"><code>Webhook → /anonymize → NVIDIA NIM → /deanonymize → Respond
</code></pre><p>The NVIDIA NIM call includes a system prompt instructing it to treat the text as normal input. No mention of tokens, no special handling — because the text looks like real text.</p>
<p>Wire any upstream source to the webhook: Jira event, Slack slash command, a scheduled job that processes internal docs. The pipeline is source-agnostic.</p>
<h2 id="the-caveats">The caveats</h2>
<p><strong>1.7B isn&rsquo;t GPT-4.1.</strong> The model scores 9.20/10 on the benchmark — which means roughly 1 in 10 cases has a missed or incorrect entity. Test with real examples from your domain before depending on it.</p>
<p><strong>Deanonymization breaks on heavy rephrasing.</strong> If the cloud model restructures a sentence enough that the fake value no longer appears verbatim, the substitution silently misses it. The prompt helps but doesn&rsquo;t eliminate the risk.</p>
<p><strong>Ollama adds a deployment.</strong> It&rsquo;s ~500MB image + the model weights (~1GB Q4). On a constrained single-node cluster that&rsquo;s real overhead. llama-server already covers general chat; Ollama is purely for this model&rsquo;s tool-calling support.</p>
<h2 id="source">Source</h2>
<p><a href="https://github.com/janos-gyorgy/llm-anonymizer">github.com/janos-gyorgy/llm-anonymizer</a> — MIT licensed, Kubernetes manifests and n8n workflow included.</p>
]]></content:encoded></item><item><title>📈 Observing Local LLM Inference: llama.cpp's Built-in Prometheus Metrics</title><link>https://blog.hippotion.com/posts/llm-observability-llamacpp-prometheus/</link><pubDate>Fri, 29 Aug 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/llm-observability-llamacpp-prometheus/</guid><description>llama.cpp&amp;rsquo;s inference server ships a /metrics endpoint. One flag, Prometheus scraping, a Grafana dashboard loaded via ConfigMap sidecar — AI observability without a proxy layer.</description><content:encoded><![CDATA[<h2 id="what-operating-an-llm-actually-means">What &ldquo;operating an LLM&rdquo; actually means</h2>
<p>Running a local model is easy. Understanding what it&rsquo;s doing is less so.</p>
<p>After deploying llama.cpp + Open WebUI on k3s (<a href="/posts/local-llm-k8s-no-gpu/">previous post</a>), I had a chat interface backed by a local model. What I didn&rsquo;t have: any visibility into how the model was behaving — whether requests were queuing, how fast tokens were being generated, how much of the context window was in use.</p>
<p>The instinct for this kind of problem is usually &ldquo;add a proxy layer.&rdquo; There are several tools in this space — LiteLLM being the most popular — that sit between the client and the inference server and record token counts, latency, and spend. I tried this first. LiteLLM OOMed at startup on a node already at 76% memory. Heavy Python import tree, not a lot of headroom.</p>
<p>The thing I&rsquo;d missed: llama.cpp ships a Prometheus metrics endpoint. No proxy required.</p>
<hr>
<h2 id="--metrics"><code>--metrics</code></h2>
<p>One additional argument to the inference server:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">args</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- -<span class="l">m</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">/models/Phi-3.5-mini-instruct-Q4_K_M.gguf</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">host</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;0.0.0.0&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">port</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;8080&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">ctx-size</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;4096&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="kc">n</span>-<span class="l">predict</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;1024&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">parallel</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;1&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">metrics       </span><span class="w"> </span><span class="c"># ← this</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">log-disable</span><span class="w">
</span></span></span></code></pre></div><p>After restart, <code>GET /metrics</code> on port 8080 returns valid Prometheus exposition format:</p>
<pre tabindex="0"><code># HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total 0

# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 0

# HELP llamacpp:requests_processing Number of requests processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
</code></pre><p>The full set of metrics:</p>
<table>
	<thead>
			<tr>
					<th>Metric</th>
					<th>Type</th>
					<th>What it measures</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td><code>llamacpp:prompt_tokens_total</code></td>
					<td>counter</td>
					<td>Input tokens processed (cumulative)</td>
			</tr>
			<tr>
					<td><code>llamacpp:tokens_predicted_total</code></td>
					<td>counter</td>
					<td>Output tokens generated (cumulative)</td>
			</tr>
			<tr>
					<td><code>llamacpp:prompt_tokens_seconds</code></td>
					<td>gauge</td>
					<td>Current prompt throughput (tok/s)</td>
			</tr>
			<tr>
					<td><code>llamacpp:predicted_tokens_seconds</code></td>
					<td>gauge</td>
					<td>Current generation throughput (tok/s)</td>
			</tr>
			<tr>
					<td><code>llamacpp:tokens_predicted_seconds_total</code></td>
					<td>counter</td>
					<td>Total time spent generating</td>
			</tr>
			<tr>
					<td><code>llamacpp:prompt_seconds_total</code></td>
					<td>counter</td>
					<td>Total time spent on prompts</td>
			</tr>
			<tr>
					<td><code>llamacpp:requests_processing</code></td>
					<td>gauge</td>
					<td>Requests currently being processed</td>
			</tr>
			<tr>
					<td><code>llamacpp:requests_deferred</code></td>
					<td>gauge</td>
					<td>Requests queued, waiting for a slot</td>
			</tr>
			<tr>
					<td><code>llamacpp:n_decode_total</code></td>
					<td>counter</td>
					<td>Total llama_decode() calls</td>
			</tr>
			<tr>
					<td><code>llamacpp:n_busy_slots_per_decode</code></td>
					<td>counter</td>
					<td>Slots active per decode call</td>
			</tr>
	</tbody>
</table>
<p>These cover the metrics that matter for a personal inference server: throughput, latency (derivable from total time / total tokens), and queue depth.</p>
<hr>
<h2 id="prometheus-scrape-config">Prometheus scrape config</h2>
<p>Adding a static scrape target in the existing Prometheus configuration:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">extraScrapeConfigs</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd">
</span></span></span><span class="line"><span class="cl"><span class="sd">  - job_name: llama-server
</span></span></span><span class="line"><span class="cl"><span class="sd">    static_configs:
</span></span></span><span class="line"><span class="cl"><span class="sd">      - targets:
</span></span></span><span class="line"><span class="cl"><span class="sd">          - llama-server.web-ai-engine.svc:8080
</span></span></span><span class="line"><span class="cl"><span class="sd">    metrics_path: /metrics</span><span class="w">
</span></span></span></code></pre></div><p>The only non-obvious thing here is the network policy: Prometheus lives in <code>dashboard-homelab</code>, and llama-server lives in <code>web-ai-engine</code>. With Cilium network policies enforcing namespace isolation, the dashboard namespace needs to be allowed to make inbound connections to the AI engine namespace. In <code>applications.yml</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl">- <span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">web-ai-engine</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">networkPolicies</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">allowIngressFromNamespaces</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">dashboard-homelab]</span><span class="w">
</span></span></span></code></pre></div><p>Without this, Prometheus scrape attempts fail silently with a timeout.</p>
<hr>
<h2 id="grafana-dashboard-via-configmap">Grafana dashboard via ConfigMap</h2>
<p>Rather than importing a dashboard JSON manually through the Grafana UI, the Grafana sidecar handles it automatically. Any ConfigMap with the label <code>grafana_dashboard: &quot;1&quot;</code> is picked up, loaded, and available in Grafana — across all namespaces by default.</p>
<p>The dashboard ConfigMap lives in <code>web-ai-engine</code>, not <code>dashboard-homelab</code>. The sidecar finds it regardless:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">v1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">ConfigMap</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">grafana-dashboard-llm</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">web-ai-engine</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">labels</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">grafana_dashboard</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;1&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">data</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">llm-metrics.json</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd">
</span></span></span><span class="line"><span class="cl"><span class="sd">    {
</span></span></span><span class="line"><span class="cl"><span class="sd">      &#34;title&#34;: &#34;LLM Metrics&#34;,
</span></span></span><span class="line"><span class="cl"><span class="sd">      &#34;uid&#34;: &#34;llm-metrics&#34;,
</span></span></span><span class="line"><span class="cl"><span class="sd">      ...
</span></span></span><span class="line"><span class="cl"><span class="sd">    }</span><span class="w">
</span></span></span></code></pre></div><p>Argo CD reconciles the ConfigMap. The Grafana sidecar picks it up. The dashboard appears. No manual steps, no Grafana UI interaction, no state outside Git.</p>
<p>This means the dashboard is version-controlled, reproducible on cluster rebuild, and consistent across environments. The same YAML that describes the app&rsquo;s Kubernetes resources also describes what the monitoring looks like.</p>
<hr>
<h2 id="what-the-dashboard-shows">What the dashboard shows</h2>
<p>After sending a few messages through Open WebUI:</p>
<p><strong>Generation throughput</strong> — the <code>llamacpp:predicted_tokens_seconds</code> gauge drops to 0 between requests and spikes during generation. On this hardware (Intel N100, CPU-only inference, Phi-3.5-mini Q4_K_M), it reads 3–5 tok/s during active generation. This is the number to watch if you&rsquo;re comparing models or quantisation levels.</p>
<p><strong>Cumulative tokens</strong> — <code>llamacpp:prompt_tokens_total</code> and <code>llamacpp:tokens_predicted_total</code> both increase monotonically. The ratio between them is roughly the input/output ratio of your usage pattern. For conversational use it&rsquo;s typically 3:1 prompt to generation; for summarisation tasks it flips.</p>
<p><strong>Queue depth</strong> — <code>llamacpp:requests_deferred</code> is 0 almost always, which is expected with <code>--parallel 1</code>. If it&rsquo;s consistently above 0, you have more concurrent users than the server can handle with the current slot configuration.</p>
<p><strong>ms/token</strong> — derived from <code>rate(llamacpp:tokens_predicted_seconds_total[5m]) / rate(llamacpp:tokens_predicted_total[5m]) * 1000</code>. This is the per-token latency, which is the number that governs whether the response feels fast or slow. 200–300ms/token feels instant; above 400ms you start noticing.</p>
<hr>
<h2 id="whats-missing-compared-to-a-proxy-layer">What&rsquo;s missing compared to a proxy layer</h2>
<p>LiteLLM and similar proxies give you things this setup doesn&rsquo;t:</p>
<ul>
<li><strong>Per-model routing</strong> — if you&rsquo;re running multiple models, a proxy can route requests to the right one. With a single model, irrelevant.</li>
<li><strong>Virtual API keys</strong> — per-user or per-application key scoping. Not needed when the whole thing is behind SSO.</li>
<li><strong>Spend tracking</strong> — meaningful when you&rsquo;re paying per token. For a local model, the cost is electricity, which Prometheus already covers through the power monitoring dashboard.</li>
</ul>
<p>For a single-model homelab, the native metrics are sufficient. If I add more models later or need per-user attribution, a proxy layer becomes worth the RAM.</p>
<hr>
<h2 id="the-pattern">The pattern</h2>
<p>The broader point is that the observable unit here isn&rsquo;t the proxy — it&rsquo;s the inference server itself. Scraping llama.cpp directly means the metrics survive proxy changes, backend swaps, or routing redesigns. The inference server is the thing doing the work; it&rsquo;s the right place to measure.</p>
<p>Starter manifests with the metrics configuration included: <a href="https://github.com/janos-gyorgy/homelab-ai-inference-starter">homelab-ai-inference-starter</a></p>
]]></content:encoded></item><item><title>🤖 Local LLM Inference on Kubernetes, No GPU Required</title><link>https://blog.hippotion.com/posts/local-llm-k8s-no-gpu/</link><pubDate>Fri, 15 Aug 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/local-llm-k8s-no-gpu/</guid><description>A CPU-only self-hosted LLM stack running on k3s: llama.cpp as the inference server, Open WebUI as the chat interface, deployed as a single Git push.</description><content:encoded><![CDATA[<h2 id="the-gpu-assumption">The GPU assumption</h2>
<p>Most write-ups about self-hosting LLMs start with a GPU. A 3090, an A100, at minimum something with CUDA. The implication is that without one you&rsquo;re wasting your time — inference will be too slow to be useful.</p>
<p>That&rsquo;s not been my experience.</p>
<p>I&rsquo;ve been running a local LLM stack on a ThinkCentre mini PC (Intel N100, 16 GB RAM, no discrete GPU) for a few months. The model is Phi-3.5-mini-instruct, 3.8 billion parameters, 4-bit quantised. Response time is 3–6 tokens per second on CPU — slow enough that you notice it, fast enough that you use it. For the things I actually reach for a local model to do — rephrase something, summarise a document, explain a config option without sending it to an external API — the latency is fine.</p>
<p>The point isn&rsquo;t that CPU inference beats GPU inference. It&rsquo;s that &ldquo;good enough for personal use&rdquo; is a much lower bar than &ldquo;production LLM serving&rdquo;, and the hardware you already have probably clears it.</p>
<hr>
<h2 id="the-stack">The stack</h2>
<p>Two components:</p>
<p><strong>llama.cpp</strong> (<code>ghcr.io/ggml-org/llama.cpp:server</code>) — inference server that loads a GGUF model file and exposes an OpenAI-compatible REST API. No Python, no framework overhead, minimal memory footprint beyond the model itself.</p>
<p><strong>Open WebUI</strong> (<code>ghcr.io/open-webui/open-webui</code>) — a polished chat interface that speaks OpenAI API format. It points at the llama-server endpoint as its backend, handles conversation history, and supports RAG file uploads out of the box.</p>
<p>The architecture is simple on purpose:</p>
<pre tabindex="0"><code>Browser → Open WebUI (:80)
              │
              │  OpenAI-compatible API
              ▼
         llama-server (:8080)
              │
              │  reads GGUF model file
              ▼
         hostPath /srv/ai-models
</code></pre><p>Open WebUI doesn&rsquo;t know or care that the backend is llama.cpp running on CPU. It sees an OpenAI-compatible API. This matters: if I swap llama-server for Ollama, vLLM, or a cloud endpoint, the frontend doesn&rsquo;t change. The interface is the standard.</p>
<hr>
<h2 id="model-choice">Model choice</h2>
<p>GGUF models on Hugging Face are available at multiple quantisation levels. The trade-off is quality vs. RAM:</p>
<table>
	<thead>
			<tr>
					<th>Model</th>
					<th>Quant</th>
					<th>Size</th>
					<th>RAM at runtime</th>
					<th>Notes</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>Llama-3.2-3B</td>
					<td>Q4_K_M</td>
					<td>~2 GB</td>
					<td>~3 GB</td>
					<td>Fastest, lowest quality</td>
			</tr>
			<tr>
					<td>Phi-3.5-mini</td>
					<td>Q4_K_M</td>
					<td>~2.4 GB</td>
					<td>~3–4 GB</td>
					<td>Good balance — what I use</td>
			</tr>
			<tr>
					<td>Mistral-7B-Instruct</td>
					<td>Q4_K_M</td>
					<td>~4.1 GB</td>
					<td>~5–6 GB</td>
					<td>Noticeably better, needs more RAM</td>
			</tr>
			<tr>
					<td>Llama-3.1-8B</td>
					<td>Q4_K_M</td>
					<td>~4.7 GB</td>
					<td>~6–8 GB</td>
					<td>High quality, stretches 16 GB with other workloads</td>
			</tr>
	</tbody>
</table>
<p>On 16 GB RAM with a full k3s stack running alongside (Argo CD, Traefik, Vault, Prometheus, etc.), Phi-3.5-mini leaves enough headroom that the cluster stays stable. Mistral-7B works too, but it&rsquo;s tighter.</p>
<p>Models live in <code>/srv/ai-models</code> on the node, mounted into the pod as a <code>hostPath</code> volume. Single-node homelab, so there&rsquo;s no scheduling concern. Download once with <code>wget</code>, done.</p>
<hr>
<h2 id="key-configuration-choices">Key configuration choices</h2>
<p><strong>Context size (<code>--ctx-size 4096</code>):</strong> How many tokens the model holds in its attention window. Larger context = more RAM + slower inference. 4096 is fine for conversational use. If you&rsquo;re summarising long documents, bump to 8192 and watch your RAM usage.</p>
<p><strong>Max output tokens (<code>--n-predict 1024</code>):</strong> Hard cap on response length. llama.cpp will stop there even mid-sentence. 1024 is usually enough; increase if you find it cutting off long explanations.</p>
<p><strong>Parallel slots (<code>--parallel 1</code>):</strong> How many concurrent inference requests the server handles. On CPU there&rsquo;s no benefit to more than 1 — each slot competes for the same cores. Leave it at 1.</p>
<p><strong>Memory limits:</strong> Set the container limit to roughly 2× the model&rsquo;s file size. A 2.4 GB GGUF typically uses 3–4 GB at runtime with context loaded.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">resources</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">requests</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="l">500m</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">memory</span><span class="p">:</span><span class="w"> </span><span class="l">1Gi</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">limits</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">memory</span><span class="p">:</span><span class="w"> </span><span class="l">6Gi</span><span class="w">
</span></span></span></code></pre></div><p>No CPU limit. llama-server will use however many cores are available during inference — that&rsquo;s what makes it usable. A CPU limit would throttle inference to unusable speeds.</p>
<hr>
<h2 id="deployment-as-a-gitops-push">Deployment as a GitOps push</h2>
<p>The whole stack lives in one YAML values file, deployed through the <a href="https://github.com/janos-gyorgy/gitops-extra-objects-chart">extra-objects chart</a> that I use for raw manifests across the cluster. Argo CD watches the repo and reconciles automatically.</p>
<p>Nothing was <code>kubectl apply</code>-ed. The deployment happened by pushing to Git.</p>
<p>What that means in practice: when I bumped the Open WebUI image version, I changed one line, pushed, and Argo CD rolled the pod. No manual steps, no SSH, no <code>kubectl</code>. The same process I use for any other service in the cluster.</p>
<p>The namespace, network policies, service account, and RBAC all generate from a single entry in <code>applications.yml</code> — same as every other app. The AI inference stack isn&rsquo;t special from an operations perspective.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># applications.yml excerpt</span><span class="w">
</span></span></span><span class="line"><span class="cl">- <span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">web-ai-engine</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">applications</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="nt">applicationCode</span><span class="p">:</span><span class="w"> </span><span class="l">web-ai-engine</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">path</span><span class="p">:</span><span class="w"> </span><span class="l">helm-charts/extra-objects</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">autoSync</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
</span></span></span></code></pre></div><hr>
<h2 id="access-and-auth">Access and auth</h2>
<p>The service is exposed at <code>ai.hippotion.com</code> through the same dual-path ingress setup I use everywhere: Cloudflare Tunnel for external access, direct-to-server via Pi-hole DNS for local access, Traefik handling both with a wildcard Let&rsquo;s Encrypt cert. See <a href="/posts/homelab-dual-path-tls/">that post</a> for the full explanation.</p>
<p>Auth is handled by Traefik&rsquo;s ForwardAuth middleware pointing at an oauth2-proxy backed by GitLab. Open WebUI&rsquo;s own auth is disabled (<code>WEBUI_AUTH: false</code>) — the OAuth layer upstream handles it. One login covers every service in the cluster.</p>
<p>The <code>WEBUI_SECRET_KEY</code> (used to sign Open WebUI sessions) comes from Vault via External Secrets Operator. Nothing sensitive in Git.</p>
<hr>
<h2 id="what-the-day-to-day-is-actually-like">What the day-to-day is actually like</h2>
<p>Slow is the obvious caveat. Phi-3.5-mini at 3–6 tok/s means a paragraph-length response takes 20–30 seconds. For coding help where you&rsquo;re reading what came before while it generates, that&rsquo;s fine. For quick factual lookups, it&rsquo;s a little tedious.</p>
<p>The useful cases for a local model, for me:</p>
<ul>
<li><strong>Rephrasing or editing text</strong> — paste something, ask it to tighten it. No data leaves the house.</li>
<li><strong>Config explanation</strong> — paste a Kubernetes manifest or a Traefik config block, ask what it does. Again, stays local.</li>
<li><strong>Quick summaries</strong> — short documents, log snippets, error messages.</li>
<li><strong>Experimentation</strong> — trying prompting techniques, testing system prompts, benchmarking quantisation levels without API costs.</li>
</ul>
<p>For longer reasoning tasks I use a cloud model. The local stack is for the cases where I want the answer to stay on-premises, or where I&rsquo;m iterating and don&rsquo;t want to pay per token.</p>
<hr>
<h2 id="the-starting-point-if-you-want-to-try-it">The starting point if you want to try it</h2>
<p>The manifests are on GitHub: <a href="https://github.com/janos-gyorgy/homelab-ai-inference-starter">homelab-ai-inference-starter</a></p>
<p>It includes the llama-server and Open WebUI deployments, resource configuration, and ingress options for Traefik and nginx. The README walks through downloading a model, applying the manifests, and the configuration knobs worth knowing.</p>
<p>No GPU required. The ThinkCentre in the corner of my desk does the job.</p>
]]></content:encoded></item></channel></rss>