<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Llm on hippotion</title><link>https://blog.hippotion.com/tags/llm/</link><description>Recent content in Llm on hippotion</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Fri, 15 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.hippotion.com/tags/llm/index.xml" rel="self" type="application/rss+xml"/><item><title>VoteWatch: How Your Representatives Voted — and Whether You'd Agree</title><link>https://blog.hippotion.com/posts/votewatch/</link><pubDate>Fri, 15 May 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/votewatch/</guid><description>Parliamentary roll-call votes are public, machine-readable, and almost completely unread. I built a thing that scrapes them, distills each decision into one plain-language question, shows which party voted which way, and lets you register whether you agree — then puts your answer next to how parliament actually voted. The rule that keeps it honest: the AI writes the summary, but it never decides a fact.</description><content:encoded><![CDATA[<h2 id="open-data-nobody-opens">Open data nobody opens</h2>
<p>Every vote in the European Parliament and the Slovak National Council is
public. The EU even ships it as a clean API. And almost nobody reads it,
because the raw record is unreadable: <em>&ldquo;Návrh poslanca… ktorým sa dopĺňa zákon
č. 581/2004 Z. z. … (tlač 1259) — tretie čítanie, hlasovanie o návrhu zákona
ako o celku.&rdquo;</em> Multiply that by a few hundred votes a sitting. Transparency
that no human can parse is transparency on paper only.</p>
<p>So I built <strong>VoteWatch</strong> — a small site on my homelab that turns the record
into something a citizen can actually use: <em>what was decided, who voted, and
do you agree?</em></p>
<figure>
    <img loading="lazy" src="sk-overview.png"
         alt="VoteWatch SK in plain-language mode"/> <figcaption>
            <p>VoteWatch SK: each decision summarised in plain language, which parties voted how, and a Yes/No question whose live citizen tally sits next to how parliament actually voted — labelled <em>agree</em> or <em>gap</em>.</p>
        </figcaption>
</figure>

<h2 id="two-halves-one-lopsided">Two halves, one lopsided</h2>
<p>The EU half was easy. <a href="https://howtheyvote.eu">HowTheyVote.eu</a> already did the
hard work and publishes roll-call votes as a clean, open-licensed API. You
consume it; you don&rsquo;t scrape it.</p>
<p>The Slovak half is where the real work lives — and the real value. <code>nrsr.sk</code>
has <strong>no API</strong>. The HTML is the contract: a results listing, and per-vote
pages where each MP appears next to a one-letter code (<code>[Z]</code> za, <code>[P]</code> proti,
<code>[?]</code> zdržal sa). So the national half is a genuine scraper — the unglamorous
kind that nobody maintains, which is exactly why a gap exists to fill. The
unglamorous part <em>is</em> the moat.</p>
<h2 id="from-ten-votes-to-one-question">From ten votes to one question</h2>
<p>A single bill generates a pile of procedural roll-calls — shorten the debate,
move to third reading, amendment block A, amendment block B, the bill as a
whole. Ten rows that are really one decision. Nobody wants ten rows.</p>
<p>So the pipeline groups votes by bill, then asks an LLM (llama-3.3-70b on
NVIDIA NIM) to do exactly one job: turn the bureaucratic titles into a plain
headline, two sentences of summary, and <strong>one neutral Yes/No question</strong> a
person can actually answer. Seven votes on the health-insurer bill collapse
into: <em>&ldquo;Changes to the health-insurance law&rdquo;</em> → <em>&ldquo;Do you agree with the
health-insurance bill?&rdquo;</em></p>
<h2 id="the-rule-that-keeps-it-honest">The rule that keeps it honest</h2>
<p>Here&rsquo;s the line I won&rsquo;t cross, and it&rsquo;s the whole reason I trust the result:
<strong>the AI writes the prose, but it never decides a fact.</strong></p>
<ul>
<li>Which votes belong to one bill? Deterministic — parsed from the bill number.</li>
<li>Did it pass? Deterministic — read from the result row.</li>
<li>Which parties voted for, against, abstained? Deterministic — tallied from
the per-MP record, shown as <em>Za: SMER-SD, HLAS-SD, SNS · Zdržali sa: PS, KDH,
SaS</em>.</li>
</ul>
<p>The model only touches language: the headline, the summary, the question. If
it hallucinates, you get an awkward sentence — never a wrong vote count. And
if the model fails entirely, the card falls back to the raw title. The facts
come from the record; the model just makes the record legible. For civic data,
that separation isn&rsquo;t a nice-to-have — it&rsquo;s the difference between a tool and a
liability. (Every card says so out loud: <em>summaries are AI-generated; the raw
record prevails.</em>)</p>
<h2 id="the-part-that-closes-the-loop">The part that closes the loop</h2>
<p>Showing people how their representatives voted is only half a feedback loop.
The other half is letting them answer.</p>
<p>Each decision carries its one distilled question and two buttons — <strong>Áno / Nie</strong>.
You vote, and the site shows the citizen tally <em>next to</em> how parliament
actually decided, with the honest verdict on top: <em>&quot;✓ Citizens and Parliament
agree&quot;</em> or <em>&quot;⚖ Gap between citizens and Parliament.&quot;</em> That gap is the entire
point. It&rsquo;s the thesis behind a side project of mine called
<a href="https://veracracy.hippotion.com">veracracy</a> — governance measured against
verified knowledge and the actual will of the governed — made concrete enough
to click.</p>
<figure>
    <img loading="lazy" src="eu-overview.png"
         alt="VoteWatch EU overview mode"/> <figcaption>
            <p>The same loop on the European Parliament — dossiers consolidated, political-group stances (EPP, S&amp;D, PfE…), and the citizen poll under each topic.</p>
        </figcaption>
</figure>

<p>The backend is deliberately boring. The site is static (git-synced nginx,
same as this blog). Votes can&rsquo;t POST to a static page, so they go to a public
<a href="https://n8n.hippotion.com">n8n</a> webhook that records to a data table and
returns live tallies — no new service, no database, just the automation box I
already run. Vote keys are namespaced so EU and Slovak polls share one store
without colliding.</p>
<h2 id="the-honest-caveat">The honest caveat</h2>
<p>Dedup is browser-local. It stops casual double-voting, but behind a Cloudflare
tunnel every request shares one IP, so this is an <strong>indicative signal, not a
secured ballot</strong>. That&rsquo;s the right altitude for &ldquo;let people express an
opinion.&rdquo; The day it needs to mean more than that, it needs real identity
first — and I&rsquo;d rather ship the honest version than fake the robust one.</p>
<p>It&rsquo;s live at <a href="https://votewatch.hippotion.com">votewatch.hippotion.com</a> — the
EU parliament and the Slovak NR SR, every MEP and every poslanec, in plain
language, with a button that asks the only question that matters after a vote:
<strong>would you have voted the same way?</strong></p>
<p>A neutral record — what was decided and who decided it — not a villain list.
Data © <a href="https://howtheyvote.eu">HowTheyVote.eu</a> (ODbL) and <code>nrsr.sk</code>.</p>
]]></content:encoded></item><item><title>Mind the gap: I pointed monitoring at my own skill set</title><link>https://blog.hippotion.com/posts/mind-the-gap-skill-radar/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/mind-the-gap-skill-radar/</guid><description>A rejection isn&amp;rsquo;t actionable data. So an n8n workflow now extracts skill demand from live job listings, diffs it against what I can prove, and renders the gap as a dashboard — deployed like everything else here: via git push.</description><content:encoded><![CDATA[<p>A while back I applied for a senior platform role at n8n and didn&rsquo;t land it. Fair enough — but
&ldquo;fair enough&rdquo; isn&rsquo;t actionable. Rejections come with no logs, no metrics, no trace. For someone
who runs thirty-odd services with full observability, having <em>vibes</em> as the only instrumentation
on my own career felt architecturally embarrassing.</p>
<p>So I built <strong>mind-the-gap</strong>: a pipeline that measures what the market demands, diffs it against
what I can prove, and renders the gap as a private dashboard on my cluster. The job hunt is now a
monitored system. This post is about the non-obvious decisions.</p>
<h2 id="demand-an-llm-reads-job-listings-so-i-dont-have-to">Demand: an LLM reads job listings so I don&rsquo;t have to</h2>
<p>I already had <a href="/posts/ats-job-poller/">a job poller</a> — an n8n workflow that polls the public ATS
APIs (Greenhouse / Lever / Ashby) of ~33 companies plus a broad remote-jobs feed every six hours.
A sibling workflow now re-fetches the same boards and, for every listing that passes the
role+location gate, asks a small hosted LLM (Llama-3.1-8B) for a structured extraction:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span><span class="nt">&#34;seniority&#34;</span><span class="p">:</span> <span class="s2">&#34;senior&#34;</span><span class="p">,</span> <span class="nt">&#34;skills&#34;</span><span class="p">:</span> <span class="p">[{</span><span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;kubernetes&#34;</span><span class="p">,</span> <span class="nt">&#34;importance&#34;</span><span class="p">:</span> <span class="s2">&#34;must&#34;</span><span class="p">},</span> <span class="err">...</span><span class="p">]}</span>
</span></span></code></pre></div><p>One row per <em>(job, skill)</em> lands in an n8n Data Table. Decisions that mattered:</p>
<ul>
<li><strong>One LLM call per job, not one batch.</strong> Free-tier inference times out on batches; per-job calls
are slower but fail independently. A lesson the poller already paid for.</li>
<li><strong>Insert doubles as the processed-marker.</strong> A job whose extraction fails to parse produces no
rows — so it&rsquo;s retried next run, for free. No status column, no second table.</li>
<li><strong>Canonicalization in code, not in the prompt.</strong> The model says &ldquo;K8s&rdquo;, &ldquo;k3s&rdquo;, &ldquo;EKS&rdquo; on
different days regardless of instructions. A dumb alias map (<code>k8s→kubernetes</code>, <code>eks→aws</code>)
beats prompt engineering for consistency.</li>
<li><strong>8B is good enough — with a guard.</strong> It occasionally echoed the seniority enum back literally
(<code>&quot;junior|mid|senior|staff|lead|unspecified&quot;</code>). The fix is one line of validation, not a bigger
model.</li>
</ul>
<h2 id="supply-no-artifact-no-credit">Supply: no artifact, no credit</h2>
<p>The other side of the diff is a skills registry — markdown in my knowledge vault, with a
machine-parseable YAML block. Every skill has a state, and the rule that keeps the whole thing
honest is brutal: <strong>a skill counts as <code>proven</code> only if an artifact exists</strong> — a public repo, a
blog post, documented production experience. Otherwise it&rsquo;s <code>claimed</code>, and claimed earns half
credit.</p>
<p>That rule immediately produced the most useful insight of the project: <strong>&ldquo;invisible skill&rdquo; is a
real category.</strong> Python turned out to be the market&rsquo;s #5 ask. I use it constantly — and could
point to nothing public that shows it. The cheapest score increase isn&rsquo;t learning something new;
it&rsquo;s a weekend making an existing skill visible. No gut-feeling gap analysis would have ranked
&ldquo;write about what you already do&rdquo; above &ldquo;learn the shiny thing.&rdquo;</p>
<h2 id="the-score-distinct-companies-not-mentions">The score: distinct companies, not mentions</h2>
<p>First naive aggregation: Canonical&rsquo;s listings mention Ubuntu <em>nine times, all marked must-have</em> —
suddenly Ubuntu looks like the hottest skill in Europe. Employer skew is the noise floor of small
samples. The fix: demand weight = <strong>distinct companies naming the skill</strong>, not total mentions.
One enthusiastic employer can&rsquo;t move the radar.</p>
<p>Two more scoring rules I&rsquo;d defend in review:</p>
<ul>
<li>Skills named by fewer than two companies don&rsquo;t count at all — single-listing noise stays out.</li>
<li>Demand the registry hasn&rsquo;t classified yet shows up as &ldquo;unreviewed&rdquo; and <strong>counts fully against
the score</strong>. An unreviewed market signal is a gap until proven otherwise; the dashboard nags me
to triage it.</li>
</ul>
<h2 id="rendering-the-page-is-a-git-commit">Rendering: the page is a git commit</h2>
<p>The dashboard is a single static HTML file, and the pipeline that produces it never touches the
cluster. <code>render.js</code> lives in this repo as the single source of truth; a nightly n8n workflow
fetches it raw from GitLab, <code>eval()</code>s it against the Data Table rows and the registry, and — only
if the result differs from what&rsquo;s committed (timestamps stripped, or every night is a &ldquo;change&rdquo;) —
PUTs the new <code>index.html</code> back via the GitLab API.</p>
<p>Serving is the same pattern as this blog: nginx plus a git-pull sidecar, deployed by Argo CD,
behind the cluster&rsquo;s OAuth middleware. The renderer has no kubeconfig, no SSH, no cluster access
of any kind. <strong>GitLab stays the only source of truth — even for a page that rewrites itself
nightly.</strong> If the workflow goes rogue, the worst it can do is a reviewable commit.</p>
<h2 id="day-one-verdict">Day-one verdict</h2>
<p>First run: 2,297 postings fetched, 25 in scope, 257 skill rows. Coverage score: <strong>63%</strong>.
Kubernetes and AWS tied at the top of demand — which means the AWS gap-closing project already in
flight stopped being a hunch and became the measured top of the market. Go is the only top-ten
demand with zero supply. The dashboard doesn&rsquo;t get anyone a job; it just makes sure every learning
Saturday is pointed where the data says, not where the hype does.</p>
<p>The job board rejected me. The data didn&rsquo;t.</p>
<hr>
<p><em>Workflows, render.js, and setup: <a href="https://github.com/janos-gyorgy/mind-the-gap">github.com/janos-gyorgy/mind-the-gap</a>.</em></p>
]]></content:encoded></item><item><title>🎯 Know the Market Without Job-Hunting: An LLM-Scored Job Poller in n8n</title><link>https://blog.hippotion.com/posts/ats-job-poller/</link><pubDate>Fri, 13 Feb 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/ats-job-poller/</guid><description>You don&amp;rsquo;t have to be job-hunting to want to know your market — what&amp;rsquo;s out there, what it pays, where you&amp;rsquo;d fit. So I built an n8n workflow: it polls the public ATS APIs (Greenhouse/Lever/Ashby) plus a broad remote-jobs feed, filters for remote-EU infra roles, scores each posting against my CV with an LLM, and emails me only the 80%+ matches. No database, no scraping.</description><content:encoded><![CDATA[<p>You don&rsquo;t have to be about to change jobs to want to know the landscape. What&rsquo;s being built, what it
pays, where you&rsquo;d actually fit — staying current on the market (and your own worth) is just good
professional hygiene. The trouble is that <em>checking</em> is tedious, so most of us don&rsquo;t, until we&rsquo;re
already job-hunting and starting cold.</p>
<p>So I automated mine. An <a href="https://n8n.io">n8n</a> workflow on my homelab polls job boards every six hours,
scores each new posting against my profile with an LLM, and emails me only the strong matches — the
ones scoring 80%+. When it&rsquo;s quiet, it&rsquo;s silent. When something genuinely fits, I know the same day.
Here&rsquo;s what I learned building it. Repo at the bottom.</p>
<h2 id="three-apis-cover-most-of-the-market">Three APIs cover most of the market</h2>
<p>Company career pages look bespoke, but underneath, the vast majority run on one of three ATS — and
all three hand you the jobs as unauthenticated JSON:</p>
<ul>
<li><strong>Greenhouse</strong> — <code>boards-api.greenhouse.io/v1/boards/{token}/jobs?content=true</code></li>
<li><strong>Lever</strong> — <code>api.lever.co/v0/postings/{token}?mode=json</code></li>
<li><strong>Ashby</strong> — <code>api.ashbyhq.com/posting-api/job-board/{token}?includeCompensation=true</code></li>
</ul>
<p>No scraping, no headless browser. You poll the API the page itself calls, normalize the three
shapes into one <code>{ company, title, location, remote, url, posted_at, description, external_id }</code>, and
you&rsquo;re done with the hard part.</p>
<h2 id="resolve-the-token-is-half-the-battle">&ldquo;Resolve the token&rdquo; is half the battle</h2>
<p>The naive assumption — <em>the token is the company name, and everyone&rsquo;s on one of the three</em> — is half
right. When I probed my initial wishlist, <strong>roughly half 404&rsquo;d everywhere</strong>: HashiCorp (now under
IBM → Workday), SUSE (SuccessFactors), Aiven (Teamtailor), Hugging Face. They&rsquo;re on a fourth or fifth
system entirely. The honest move was to ship the ~33 that actually resolve and leave the rest as
disabled config stubs. Verify before you trust a slug.</p>
<h2 id="dedup-without-a-database">Dedup without a database</h2>
<p>I didn&rsquo;t want to stand up Postgres just to remember which jobs I&rsquo;d already seen. n8n&rsquo;s <strong>Data Tables</strong>
handle it natively: a <code>seen_jobs</code> table, an <code>external_id</code> namespaced <code>{ats}:{company}:{id}</code>, and the
<code>rowNotExists</code> operation drops anything already recorded. State lives inside n8n, backed up with it.
Zero extra infrastructure.</p>
<p>The ordering matters: <strong>notify first, mark seen second.</strong> The insert only happens after the email
sends, so a failed send retries next run instead of silently swallowing a posting.</p>
<h2 id="the-location-filter-is-a-trap">The location filter is a trap</h2>
<p>My first version kept everything that wasn&rsquo;t explicitly US-based. The inbox filled with <em>&ldquo;Senior
Platform Engineer — Spain (Remote)&rdquo;</em> and <em>&quot;… — United Kingdom (Remote)&quot;</em>. Those aren&rsquo;t remote-for-me
— they&rsquo;re remote <em>if you live in Spain</em>. Useless from where I sit.</p>
<p>The fix was to invert the logic. Keep only three things:</p>
<ul>
<li>globally-remote / worldwide / anywhere,</li>
<li>pan-EU (EMEA / Europe / EU / EEA),</li>
<li>my own country.</li>
</ul>
<p>…and <strong>drop single-country remote</strong>, even EU ones. Region and home matches win over the country
deny-list, ambiguous locations are kept (a missed match is worse than one extra line to skim). That
one change cut the noise more than anything else.</p>
<h2 id="let-an-llm-read-the-actual-job">Let an LLM read the actual job</h2>
<p>Keyword + location filtering gets you a candidate list, but it can&rsquo;t tell a &ldquo;Platform Engineer&rdquo; who
herds Kubernetes from a &ldquo;Platform Engineer&rdquo; who owns a Figma design system. The job description can.</p>
<p>So the last step scores each new posting against my CV. My first version batched all of them into
<strong>one</strong> big LLM call — which promptly timed out on the free tier. The fix was the opposite: <strong>one
small call per job</strong>, which also means a single slow or rate-limited job never sinks the batch. Each
call asks a <a href="https://build.nvidia.com">NVIDIA NIM</a> model (Llama 3.1 8B, OpenAI-compatible) for one
number and a reason:</p>
<blockquote>
<p>Score this job 0–100 for fit against my profile. Return <code>{score, reason}</code>.</p>
</blockquote>
<p>That score is what lets me <strong>widen the net instead of narrowing it.</strong> On top of the curated company
list I pull a broad remote-jobs feed (every company, all categories); the cheap keyword + location
filters do the first pass, then I <strong>only email the roles scoring 80%+.</strong> Casting wide is fine when a
model is the bar at the door. A line ends up looking like:</p>
<blockquote>
<p><strong>92%</strong> — <em>Grafana Labs</em> — Senior Platform Engineer (Remote, EMEA) — <em>strong k8s/GitOps overlap</em> — link</p>
</blockquote>
<p>Scoring is fail-safe: if a call hiccups, that job is just skipped, and every posting gets marked seen
either way — so nothing re-scores forever, and a rare bad run never floods or stalls the inbox.</p>
<h2 id="the-unglamorous-bits-that-make-it-trustworthy">The unglamorous bits that make it trustworthy</h2>
<ul>
<li><strong>One bad source can&rsquo;t kill the run</strong> — every fetch is wrapped; failures become a <code>⚠️ N sources failing</code> footer so a company quietly changing ATS is visible, not invisible.</li>
<li><strong>A prime run</strong> seeds the table silently the first time, so I&rsquo;m not buried under every currently-open
role on day one.</li>
<li><strong>Everything tunable lives in one Config node</strong> — companies, keywords, location lists, the profile,
the model — so adding a company is a one-line edit, not a graph safari.</li>
</ul>
<h2 id="takeaways">Takeaways</h2>
<ul>
<li>The &ldquo;scrape job boards&rdquo; problem mostly isn&rsquo;t a scraping problem — it&rsquo;s three public APIs and a
normalizer.</li>
<li>For personal automation, reach for the boring-but-correct primitive: native dedup state beats a
database you have to operate.</li>
<li>An LLM works best here as the <strong>bar at the door</strong>: cheap deterministic filters keep the candidate
set (and the cost) small, then the model gates on real fit — which is what lets you cast a wide net
without drowning in it.</li>
</ul>
<p>Workflow JSON, the full node-by-node breakdown, and setup notes:
<strong><a href="https://github.com/janos-gyorgy/ats-job-poller">github.com/janos-gyorgy/ats-job-poller</a></strong>.</p>
]]></content:encoded></item><item><title>🍵 I A/B-Tested Cloud vs Local LLMs in One n8n Agent. The Local One Faked It.</title><link>https://blog.hippotion.com/posts/n8n-agent-cloud-vs-local/</link><pubDate>Fri, 07 Nov 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/n8n-agent-cloud-vs-local/</guid><description>I built an AI agent in self-hosted n8n over my kombucha-tracking app, then gave it two brains — NVIDIA&amp;rsquo;s 70B and a local Phi-3.5 — sharing the same tools. The cloud model called the tools and answered from real data. The local one couldn&amp;rsquo;t, so it made things up.</description><content:encoded><![CDATA[<h2 id="the-question">The question</h2>
<p>I run <a href="https://n8n.io">n8n</a> on my k3s homelab. Not docker-compose on a NUC — the full treatment: GitOps-reconciled, Vault-backed secrets, default-deny networking. The same boring platform everything else here runs on.</p>
<p>But &ldquo;I have n8n running&rdquo; proves nothing. I wanted to know if I actually understood it as an <em>agent platform</em>, and to answer a question I kept dodging: <strong>for agent work, do I need a cloud model, or is my local one good enough?</strong></p>
<p>So I built a real agent and gave it two brains.</p>
<h2 id="what-i-built">What I built</h2>
<p>A chat assistant over brew-buddy, my homemade kombucha-tracking app (React + a small API + Postgres). You ask it things in plain language; it calls the app&rsquo;s API and answers. The twist: the same question runs through <strong>two agents in parallel</strong> — one backed by NVIDIA&rsquo;s hosted <strong>Llama-3.3-70B</strong>, one by a local <strong>Phi-3.5-mini</strong> on CPU — and the workflow prints both answers side by side.</p>
<pre tabindex="0"><code>Chat ──▶ Agent (cloud: NVIDIA 70B) ──┐   tools (shared):
     └─▶ Agent (local: Phi-3.5)   ──┤     • get_all_batches
                                    │     • get_batch_detail
                                    │     • brewing_statistics
            (Merge) ──▶ both replies, labeled     • add_batch_log   ⟵ write
                                                  • create_batch    ⟵ write
</code></pre><p>Both agents share the same read tools. The two <em>write</em> tools are wired to the cloud agent only — more on that below.</p>
<p><img alt="The kombucha agent in n8n: a chat trigger fans out to two AI Agent nodes (cloud and local), both wired to the same brew-buddy tools, then merged so the two answers print side by side." loading="lazy" src="/posts/n8n-agent-cloud-vs-local/n8n.png"></p>
<p>The nice part: I didn&rsquo;t write a line of glue. n8n&rsquo;s stock <strong>OpenAI Chat Model</strong> node talks to anything OpenAI-compatible if you override the credential&rsquo;s Base URL — so one node points at <code>https://integrate.api.nvidia.com/v1</code>, the other at <code>http://llama-server.&lt;ns&gt;.svc:8080/v1</code> for the local server. Same node, two endpoints.</p>
<h2 id="the-infra-that-keeps-it-honest">The infra that keeps it honest</h2>
<p>I won&rsquo;t re-explain the platform here — it&rsquo;s in earlier posts: <a href="/posts/homelab-gitops/">GitOps</a>, <a href="/posts/k8s-gitops-secrets/">Vault-backed secrets</a>, <a href="/posts/k8s-network-isolation/">default-deny networking</a>, <a href="/posts/homelab-dual-path-tls/">dual-path TLS ingress</a>. But building the agent made one of them <em>tangible</em>.</p>
<p>n8n is, by design, a thing that makes arbitrary HTTP calls on a schedule. That&rsquo;s exactly what you want behind a default-deny network policy. n8n couldn&rsquo;t reach the brew-buddy API at all until I declared it — one line:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># n8n&#39;s namespace</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">allowEgressToNamespaces</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">web-ai-engine, web-brew-buddy]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c">#                                          ^ added this for the agent</span><span class="w">
</span></span></span></code></pre></div><p>(plus a matching ingress-allow on brew-buddy&rsquo;s side). That&rsquo;s the posture working as intended: the blast radius of a workflow tool is whatever I&rsquo;ve explicitly granted, and not one namespace more. Adding a capability is a reviewable one-liner in Git; Argo reconciles it. No <code>kubectl</code>, no guessing what n8n can reach.</p>
<h2 id="the-ab-same-agent-same-tools-two-brains">The A/B: same agent, same tools, two brains</h2>
<p><strong>Plain &ldquo;hi&rdquo;.</strong> Cloud answers in ~0.5s. Local takes noticeably longer — because even for &ldquo;hi&rdquo;, the agent feeds the model the full system prompt <em>plus the JSON schemas for every tool</em>, and Phi-3.5 has to chew through all of it on CPU before it can say a word. So far, the boring expected result: local is slower.</p>
<p>Then I asked a real question, and the result flipped in a way I didn&rsquo;t expect.</p>
<p><strong>&ldquo;What batches do I have?&rdquo;</strong></p>
<p>Cloud (70B) called <code>get_all_batches</code>, got the real rows, and answered:</p>
<blockquote>
<p>You have two batches: 2026-04-09-A (cold-crash, 3L) and 2026-04-09-W (cold-crash, 3L).</p>
</blockquote>
<p>Local (Phi-3.5) <strong>never called the tool.</strong> It didn&rsquo;t seem to realise it <em>had</em> tools. Instead it confidently explained how <em>I</em> could go find the data myself:</p>
<blockquote>
<p>To list all batches: 1. Access the brew-buddy app. 2. Look for a button labeled &ldquo;List Batches&rdquo;… <code>def get_all_batches(): …</code> … Remember, I&rsquo;m unable to directly interact with apps or databases.</p>
</blockquote>
<p>Fake instructions. Fake code. A polite apology. Everything except the actual answer it was sitting on top of.</p>
<p><strong>Writing data.</strong> I asked both to <em>log</em> an observation. Cloud called <code>add_batch_log</code> and wrote a real row to Postgres (&ldquo;I have recorded the observation…&rdquo;). Local bluffed again — &ldquo;here&rsquo;s how <em>you</em> can log it yourself.&rdquo;</p>
<h2 id="why-it-matters-capability-not-latency">Why it matters: capability, not latency</h2>
<p>The interesting finding isn&rsquo;t &ldquo;the big model is better.&rdquo; It&rsquo;s <em>how</em> the small one fails.</p>
<p>With a ~3.8B model on CPU, the bottleneck for agent work isn&rsquo;t speed — it&rsquo;s <strong>capability</strong>. Phi-3.5 couldn&rsquo;t reliably emit tool calls, so n8n&rsquo;s tools never fired, and the model degraded into a chatbot that <strong>hallucinates a plausible answer instead of fetching the real one.</strong> That failure mode is worse than an error: an error you catch, a confident wrong answer you ship.</p>
<p>A couple of measurements that sharpened it:</p>
<ul>
<li>NVIDIA 70B, <strong>plain chat</strong>: ~0.5s.</li>
<li>NVIDIA 70B, <strong>function-calling</strong> (with tool schemas): ~8.6s per round-trip — and an agent makes several round-trips per answer. That&rsquo;s real latency you have to budget a timeout for. (It&rsquo;s also why the cloud side initially <em>timed out</em> in n8n until I raised the model node&rsquo;s timeout — the model was fine, n8n was cutting it off.)</li>
</ul>
<p>So the snappy-vs-slow comparison <strong>flips depending on whether the question triggers tools</strong>. Plain chat: cloud wins on speed. Tool use: the local model is &ldquo;fast&rdquo; only because it skips the tools and makes something up. Speed was never the real axis.</p>
<p>The honest caveat: this is <em>this</em> small general model in a multi-tool agent loop. Purpose-built small models with tool-calling fine-tunes do better at narrow tasks — I run a 1.7B one elsewhere that emits a single structured tool call just fine. But for &ldquo;pick the right tool from several and chain them,&rdquo; 70B was in a different league.</p>
<h2 id="the-trust-boundary">The trust boundary</h2>
<p>I gave the write tools (<code>add_batch_log</code>, <code>create_batch</code>) to the cloud agent <strong>only</strong>. The local agent is read-only — not by instruction, by wiring. Even if Phi-3.5 <em>did</em> decide to call a write tool, the connection isn&rsquo;t there. The reliable model is the only one allowed to mutate real data, and that&rsquo;s enforced structurally, not by trusting a prompt.</p>
<h2 id="whats-toy-and-whats-real">What&rsquo;s toy and what&rsquo;s real</h2>
<p>Worth being straight: this is a <strong>single-node homelab</strong>. The agent and both model paths share one box. Running n8n on Kubernetes and swapping models isn&rsquo;t novel — <a href="https://docs.n8n.io/hosting/scaling/queue-mode/">n8n&rsquo;s own docs</a> cover queue mode, where a main instance fans work out to a pool of worker pods you scale horizontally, with external Postgres for state. That&rsquo;s the real production shape. Mine is one replica with an emptyDir&rsquo;s worth of ambition.</p>
<p>What I think <em>is</em> worth sharing is the finding (the capability cliff, and that its failure mode is confident fabrication) and the boring thing underneath it: because the platform is default-deny and GitOps-reconciled, running this experiment cost me one reviewable egress line and zero risk to anything else.</p>
<h2 id="the-boring-part-is-the-point">The boring part is the point</h2>
<p>The AI was the fun bit. But the reason I could bolt an agent onto a live cluster, point it at a real app, give it write access to one model and not the other, and tear it all down again — without worrying what it might touch — is that the infrastructure was already boring. Default-deny. Secrets out of Git. <code>git push</code>, Argo reconciles.</p>
<p>The model picks the tools. The platform decides what the tools can reach. Keep those two honest about each other and self-hosting an agent stops being scary and starts being just another app.</p>
]]></content:encoded></item><item><title>🔒 Building a PII Guardrail Proxy for Cloud LLM Calls</title><link>https://blog.hippotion.com/posts/ai-pii-guardrail-proxy/</link><pubDate>Fri, 26 Sep 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/ai-pii-guardrail-proxy/</guid><description>A local model classifies every prompt before it leaves the cluster. If it&amp;rsquo;s sensitive, it&amp;rsquo;s blocked. If it&amp;rsquo;s clean, it goes to NVIDIA NIM. 150 lines of FastAPI, deployed on k3s.</description><content:encoded><![CDATA[<h2 id="the-problem-with-cloud-llm-access">The problem with cloud LLM access</h2>
<p>Running a local model is great for privacy. But local models hit a ceiling — for the heavy lifting, you want a cloud API like NVIDIA NIM with Llama 3.3 70B.</p>
<p>The moment you open that channel, you have a new risk: what if someone (or some automation) accidentally pastes a password, a private key, or someone&rsquo;s personal data into the chat? It leaves the cluster. It&rsquo;s logged somewhere you don&rsquo;t control.</p>
<p>The standard answer is &ldquo;train your users.&rdquo; I&rsquo;d rather have a technical control.</p>
<h2 id="the-architecture">The architecture</h2>
<pre tabindex="0"><code>Open WebUI → ai-guard proxy
                 │
        ┌────────┴────────┐
        │                 │
  llama-server       if SAFE:
  (classify)         forward to NVIDIA NIM
        │
   if SENSITIVE:
   block + explain
</code></pre><p>Every request to NVIDIA NIM goes through ai-guard first. ai-guard pulls the user message, sends it to the local llama.cpp server with a classification prompt, and makes a binary decision:</p>
<ul>
<li><code>SAFE</code> → forward to NVIDIA NIM with the real API key (which ai-guard holds, not the client)</li>
<li><code>SENSITIVE: &lt;reason&gt;</code> → return HTTP 400, log the block, nothing leaves the cluster</li>
</ul>
<p>The local model is already running for inference — this reuses it as a privacy gatekeeper at zero extra infrastructure cost.</p>
<h2 id="the-implementation">The implementation</h2>
<p>The proxy is ~150 lines of FastAPI. The classifier call:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">CLASSIFIER_PROMPT</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;You are a data security classifier. Check if the text below contains sensitive information:
</span></span></span><span class="line"><span class="cl"><span class="s2">passwords, API keys, tokens, credentials, personal identifiable information (names, emails, phone numbers, SSNs, addresses), financial data (card numbers, bank accounts), or private keys.
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">Reply with ONLY one of:
</span></span></span><span class="line"><span class="cl"><span class="s2">SAFE
</span></span></span><span class="line"><span class="cl"><span class="s2">SENSITIVE: &lt;one-line reason&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">Text to check:
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">async</span> <span class="k">def</span> <span class="nf">classify</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="nb">bool</span><span class="p">,</span> <span class="nb">str</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">    <span class="k">async</span> <span class="k">with</span> <span class="n">httpx</span><span class="o">.</span><span class="n">AsyncClient</span><span class="p">(</span><span class="n">timeout</span><span class="o">=</span><span class="mi">60</span><span class="p">)</span> <span class="k">as</span> <span class="n">client</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">resp</span> <span class="o">=</span> <span class="k">await</span> <span class="n">client</span><span class="o">.</span><span class="n">post</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">LLAMA_BASE</span><span class="si">}</span><span class="s2">/chat/completions&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">json</span><span class="o">=</span><span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;model&#34;</span><span class="p">:</span> <span class="s2">&#34;phi-3.5-mini&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;messages&#34;</span><span class="p">:</span> <span class="p">[{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">CLASSIFIER_PROMPT</span> <span class="o">+</span> <span class="n">text</span><span class="p">[:</span><span class="mi">3000</span><span class="p">]}],</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;max_tokens&#34;</span><span class="p">:</span> <span class="mi">30</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;temperature&#34;</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;stream&#34;</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="p">},</span>
</span></span><span class="line"><span class="cl">            <span class="n">headers</span><span class="o">=</span><span class="p">{</span><span class="s2">&#34;Authorization&#34;</span><span class="p">:</span> <span class="s2">&#34;Bearer sk-no-key&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">answer</span> <span class="o">=</span> <span class="n">resp</span><span class="o">.</span><span class="n">json</span><span class="p">()[</span><span class="s2">&#34;choices&#34;</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="s2">&#34;message&#34;</span><span class="p">][</span><span class="s2">&#34;content&#34;</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">answer</span><span class="o">.</span><span class="n">upper</span><span class="p">()</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s2">&#34;SENSITIVE&#34;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">reason</span> <span class="o">=</span> <span class="n">answer</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">&#34;:&#34;</span><span class="p">,</span> <span class="mi">1</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> <span class="k">if</span> <span class="s2">&#34;:&#34;</span> <span class="ow">in</span> <span class="n">answer</span> <span class="k">else</span> <span class="s2">&#34;sensitive content detected&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span><span class="p">,</span> <span class="n">reason</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">False</span><span class="p">,</span> <span class="s2">&#34;&#34;</span>
</span></span></code></pre></div><p><code>temperature=0</code> and <code>max_tokens=30</code> keep the response deterministic and fast. The model only needs to output one word or one line.</p>
<p>The main handler:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="nd">@app.post</span><span class="p">(</span><span class="s2">&#34;/v1/chat/completions&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">async</span> <span class="k">def</span> <span class="nf">proxy_chat</span><span class="p">(</span><span class="n">request</span><span class="p">:</span> <span class="n">Request</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">body</span> <span class="o">=</span> <span class="k">await</span> <span class="n">request</span><span class="o">.</span><span class="n">json</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">user_text</span> <span class="o">=</span> <span class="n">extract_user_text</span><span class="p">(</span><span class="n">body</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&#34;messages&#34;</span><span class="p">,</span> <span class="p">[]))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">user_text</span><span class="o">.</span><span class="n">strip</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">        <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">is_sensitive</span><span class="p">,</span> <span class="n">reason</span> <span class="o">=</span> <span class="k">await</span> <span class="n">classify</span><span class="p">(</span><span class="n">user_text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">exc</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">log</span><span class="o">.</span><span class="n">error</span><span class="p">(</span><span class="s2">&#34;classifier error: </span><span class="si">%s</span><span class="s2"> — allowing request through&#34;</span><span class="p">,</span> <span class="n">exc</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">is_sensitive</span> <span class="o">=</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">is_sensitive</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">JSONResponse</span><span class="p">(</span><span class="n">status_code</span><span class="o">=</span><span class="mi">400</span><span class="p">,</span> <span class="n">content</span><span class="o">=</span><span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;error&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;message&#34;</span><span class="p">:</span> <span class="sa">f</span><span class="s2">&#34;Request blocked by ai-guard: </span><span class="si">{</span><span class="n">reason</span><span class="si">}</span><span class="s2">. Remove sensitive content before sending to external models.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;content_policy_violation&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="p">}</span>
</span></span><span class="line"><span class="cl">            <span class="p">})</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Safe — forward to upstream with streaming support</span>
</span></span><span class="line"><span class="cl">    <span class="o">...</span>
</span></span></code></pre></div><p>Fail-open: if the classifier itself errors (llama-server down, timeout), the request goes through and the error is logged. Fail-closed would be safer for high-stakes environments, but this is a homelab and I&rsquo;d rather not block all cloud LLM access because the local model is warming up.</p>
<h2 id="kubernetes-deployment">Kubernetes deployment</h2>
<p>ai-guard runs in the same namespace as llama-server and Open WebUI (<code>web-ai-engine</code>). Intra-namespace traffic is always allowed in Cilium, so no new network policy needed.</p>
<p>Open WebUI uses semicolon-separated lists for multiple API backends:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl">- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">OPENAI_API_BASE_URLS</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">value</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;http://llama-server.web-ai-engine.svc:8080/v1;http://ai-guard.web-ai-engine.svc:8080/v1&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl">- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">OPENAI_API_KEYS</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">value</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;sk-no-key;sk-no-key&#34;</span><span class="w">
</span></span></span></code></pre></div><p>The second entry is ai-guard. Open WebUI passes <code>sk-no-key</code> as the API key — ai-guard ignores it and uses its own <code>UPSTREAM_API_KEY</code> from a Kubernetes Secret (pulled from Vault via External Secrets Operator). The real NVIDIA API key never touches the client.</p>
<h2 id="the-latency-tradeoff">The latency tradeoff</h2>
<p>The classification step adds 5–15 seconds on CPU inference. That&rsquo;s the cost of keeping the check fully private — the classifier never sends data anywhere.</p>
<p>For a personal homelab assistant, this is fine. For a high-throughput production setup, you&rsquo;d want the classifier on a GPU or a dedicated smaller model purpose-built for classification.</p>
<h2 id="what-it-catches">What it catches</h2>
<p>The classifier prompt targets:</p>
<ul>
<li>Passwords, API keys, tokens, credentials</li>
<li>PII: names, emails, phone numbers, SSNs, addresses</li>
<li>Financial data: card numbers, bank accounts</li>
<li>Private keys</li>
</ul>
<p>False negatives are possible — no classifier is perfect. This is a first line of defense, not a compliance control. The value is catching the obvious, accidental leaks.</p>
<h2 id="source">Source</h2>
<p><a href="https://github.com/janos-gyorgy/ai-guard">github.com/janos-gyorgy/ai-guard</a> — MIT licensed, Kubernetes manifests included.</p>
]]></content:encoded></item><item><title>🕵️ Privacy-Preserving LLM Pipelines: Anonymize Before You Send</title><link>https://blog.hippotion.com/posts/llm-anonymizer-privacy-pipeline/</link><pubDate>Fri, 12 Sep 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/llm-anonymizer-privacy-pipeline/</guid><description>Replace PII with semantically realistic fakes before sending to a cloud LLM, then restore the originals from the response. Started with a general model and prompt engineering — then upgraded to a purpose-built 1.7B fine-tune via Ollama.</description><content:encoded><![CDATA[<h2 id="the-problem-with-blocking">The problem with blocking</h2>
<p>The <a href="/posts/ai-pii-guardrail-proxy/">PII guardrail proxy I built last week</a> works by classifying prompts and blocking the sensitive ones. That&rsquo;s fine for a chat interface where a human can rephrase. It doesn&rsquo;t work for automated pipelines.</p>
<p>If a Jira ticket contains someone&rsquo;s name and an internal hostname, you don&rsquo;t want the agent to fail — you want it to process the ticket without exposing that data. Blocking is the wrong primitive for pipelines. Anonymization is the right one.</p>
<h2 id="the-pattern">The pattern</h2>
<pre tabindex="0"><code>Input text
  → anonymizer: extract PII, replace with semantic fakes
  → &#34;Nathan Chen from DataSoft LLC needs ProjectX fixed on dev.internal.net&#34;
  + mapping: {&#34;Nathan Chen&#34; → &#34;John Smith&#34;, &#34;DataSoft LLC&#34; → &#34;ACME&#34;, ...}
  → cloud LLM: processes coherent text, never sees real values
  → &#34;Nathan Chen should check the ProjectX docs with the DataSoft LLC team&#34;
  → string substitution with reverse mapping
  → &#34;John Smith should check the OAuth docs with the ACME team&#34;
</code></pre><p>Two things that make this work:</p>
<p><strong>Deanonymization needs no LLM.</strong> Once you have the mapping, restoring is pure string substitution. The model call only happens on the way in.</p>
<p><strong>Semantic fakes beat placeholder tokens.</strong> An earlier version of this used <code>[PERSON_1]</code>, <code>[ORG_1]</code> tokens. The problem: cloud models see bracketed text and subtly change behaviour — shorter responses, hedging, dropped context. When the cloud model sees <code>Nathan Chen from DataSoft LLC</code>, it treats it as real text and responds naturally. Quality is noticeably better.</p>
<h2 id="prior-art--what-already-exists">Prior art — what already exists</h2>
<p>This is a well-established pattern. Worth knowing what&rsquo;s out there:</p>
<p><strong><a href="https://llm-guard.com/output_scanners/deanonymize/">LLM Guard</a></strong> (Protect AI) — the most complete open-source implementation. Anonymize + Deanonymize scanner pair with a Vault for the mapping. Production-grade, actively maintained. Start here if you&rsquo;re building this for anything serious.</p>
<p><strong><a href="https://techcommunity.microsoft.com/blog/azuredevcommunityblog/introducing-pii-shield-a-privacy-proxy-for-every-llm-call/4514726">Microsoft PII Shield</a></strong> — session-based proxy. Returns a session ID with the anonymized text, uses it to deanonymize the response.</p>
<p><strong><a href="https://github.com/fsndzomga/anonLLM">anonLLM</a></strong> — uses GLiNER (a proper NER model) + Faker for realistic replacements. Better accuracy than a general chat model.</p>
<p><strong><a href="https://ieeexplore.ieee.org/document/11140717/">REDACT</a></strong> — IEEE paper describing a system using Ollama for PII redaction in documents.</p>
<p><strong><a href="https://huggingface.co/blog/pratyushrt/anonymizerslm">HuggingFace Anonymizer SLM series</a></strong> — purpose-built models (0.6B/1.7B/4B) fine-tuned specifically for anonymization. 9.20/10 quality score for 1.7B, close to GPT-4.1&rsquo;s 9.77.</p>
<p>That last one is what this implementation actually uses.</p>
<h2 id="the-model-anonymizer-17b">The model: Anonymizer-1.7B</h2>
<p><a href="https://huggingface.co/eternisai/Anonymizer-1.7B">eternisai/Anonymizer-1.7B</a> is a Qwen3-1.7B fine-tune trained on ~30k anonymization samples using GRPO with GPT-4.1 as judge. It outputs structured tool calls instead of free text:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;replace_entities&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;arguments&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;replacements&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">      <span class="p">{</span><span class="nt">&#34;original&#34;</span><span class="p">:</span> <span class="s2">&#34;John Smith&#34;</span><span class="p">,</span> <span class="nt">&#34;replacement&#34;</span><span class="p">:</span> <span class="s2">&#34;Nathan Chen&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">      <span class="p">{</span><span class="nt">&#34;original&#34;</span><span class="p">:</span> <span class="s2">&#34;ACME Corp&#34;</span><span class="p">,</span> <span class="nt">&#34;replacement&#34;</span><span class="p">:</span> <span class="s2">&#34;DataSoft LLC&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">      <span class="p">{</span><span class="nt">&#34;original&#34;</span><span class="p">:</span> <span class="s2">&#34;auth.acme.internal&#34;</span><span class="p">,</span> <span class="nt">&#34;replacement&#34;</span><span class="p">:</span> <span class="s2">&#34;dev.internal.net&#34;</span><span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>No prompt engineering needed. The model knows exactly what it&rsquo;s doing and outputs a structured contract. Compare that to the first version of this service, which sent a long JSON-format prompt to Phi-3.5-mini and hoped the output parsed correctly.</p>
<p>The model runs via Ollama (which handles the Qwen3 chat template and tool calling natively), pointed at the GGUF version from HuggingFace: <code>hf.co/gabriellarson/Anonymizer-1.7B-GGUF</code>.</p>
<h2 id="the-implementation">The implementation</h2>
<p><code>llm-anonymizer</code> is a FastAPI service with two endpoints.</p>
<p><strong><code>POST /anonymize</code></strong> — calls Ollama with the tool definition, parses the response:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">TOOLS</span> <span class="o">=</span> <span class="p">[{</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;function&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;function&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;replace_entities&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;description&#34;</span><span class="p">:</span> <span class="s2">&#34;Replace PII entities with anonymized versions&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;parameters&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;object&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;properties&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;replacements&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;array&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;items&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                        <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;object&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                        <span class="s2">&#34;properties&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                            <span class="s2">&#34;original&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;string&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">                            <span class="s2">&#34;replacement&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;string&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">                        <span class="p">},</span>
</span></span><span class="line"><span class="cl">                        <span class="s2">&#34;required&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;original&#34;</span><span class="p">,</span> <span class="s2">&#34;replacement&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">                    <span class="p">},</span>
</span></span><span class="line"><span class="cl">                <span class="p">}</span>
</span></span><span class="line"><span class="cl">            <span class="p">},</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;required&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;replacements&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">        <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl"><span class="p">}]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">resp</span> <span class="o">=</span> <span class="k">await</span> <span class="n">client</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">OLLAMA_BASE</span><span class="si">}</span><span class="s2">/api/chat&#34;</span><span class="p">,</span> <span class="n">json</span><span class="o">=</span><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;model&#34;</span><span class="p">:</span> <span class="n">MODEL</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;messages&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;system&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">SYSTEM_PROMPT</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">        <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">text</span> <span class="o">+</span> <span class="s2">&#34;</span><span class="se">\n</span><span class="s2">/no_think&#34;</span><span class="p">},</span>  <span class="c1"># skip Qwen3 thinking mode</span>
</span></span><span class="line"><span class="cl">    <span class="p">],</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;tools&#34;</span><span class="p">:</span> <span class="n">TOOLS</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;stream&#34;</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">})</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">tool_calls</span> <span class="o">=</span> <span class="n">resp</span><span class="o">.</span><span class="n">json</span><span class="p">()[</span><span class="s2">&#34;message&#34;</span><span class="p">][</span><span class="s2">&#34;tool_calls&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="n">replacements</span> <span class="o">=</span> <span class="n">tool_calls</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s2">&#34;function&#34;</span><span class="p">][</span><span class="s2">&#34;arguments&#34;</span><span class="p">][</span><span class="s2">&#34;replacements&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Build reverse mapping: replacement → original (for deanonymization)</span>
</span></span><span class="line"><span class="cl"><span class="n">anonymized</span> <span class="o">=</span> <span class="n">text</span>
</span></span><span class="line"><span class="cl"><span class="n">mapping</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">pair</span> <span class="ow">in</span> <span class="n">replacements</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">anonymized</span> <span class="o">=</span> <span class="n">anonymized</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">pair</span><span class="p">[</span><span class="s2">&#34;original&#34;</span><span class="p">],</span> <span class="n">pair</span><span class="p">[</span><span class="s2">&#34;replacement&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">    <span class="n">mapping</span><span class="p">[</span><span class="n">pair</span><span class="p">[</span><span class="s2">&#34;replacement&#34;</span><span class="p">]]</span> <span class="o">=</span> <span class="n">pair</span><span class="p">[</span><span class="s2">&#34;original&#34;</span><span class="p">]</span>
</span></span></code></pre></div><p>The <code>/no_think</code> suffix tells the model to skip its chain-of-thought — faster response, same accuracy for this task.</p>
<p><strong><code>POST /deanonymize</code></strong> — no model call, just substitution:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">for</span> <span class="n">replacement</span><span class="p">,</span> <span class="n">original</span> <span class="ow">in</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">mapping</span><span class="o">.</span><span class="n">items</span><span class="p">(),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">replacement</span><span class="p">,</span> <span class="n">original</span><span class="p">)</span>
</span></span></code></pre></div><p>Sorted by length descending so longer tokens don&rsquo;t get partially overwritten by shorter ones.</p>
<h2 id="the-kubernetes-stack">The Kubernetes stack</h2>
<p>Ollama runs as a separate deployment in the same namespace as everything else (<code>web-ai-engine</code>). Intra-namespace traffic is always allowed — no new network policies.</p>
<pre tabindex="0"><code>llm-anonymizer (FastAPI) → Ollama (port 11434) → Anonymizer-1.7B GGUF
</code></pre><p>One-time model pull after first deploy:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">kubectl <span class="nb">exec</span> -n web-ai-engine deploy/ollama -- <span class="se">\
</span></span></span><span class="line"><span class="cl">  ollama pull hf.co/gabriellarson/Anonymizer-1.7B-GGUF
</span></span></code></pre></div><p>Ollama caches it on a 10Gi PVC, so pod restarts don&rsquo;t re-download.</p>
<h2 id="the-n8n-pipeline">The n8n pipeline</h2>
<p>Five-node chain triggered by webhook:</p>
<pre tabindex="0"><code>Webhook → /anonymize → NVIDIA NIM → /deanonymize → Respond
</code></pre><p>The NVIDIA NIM call includes a system prompt instructing it to treat the text as normal input. No mention of tokens, no special handling — because the text looks like real text.</p>
<p>Wire any upstream source to the webhook: Jira event, Slack slash command, a scheduled job that processes internal docs. The pipeline is source-agnostic.</p>
<h2 id="the-caveats">The caveats</h2>
<p><strong>1.7B isn&rsquo;t GPT-4.1.</strong> The model scores 9.20/10 on the benchmark — which means roughly 1 in 10 cases has a missed or incorrect entity. Test with real examples from your domain before depending on it.</p>
<p><strong>Deanonymization breaks on heavy rephrasing.</strong> If the cloud model restructures a sentence enough that the fake value no longer appears verbatim, the substitution silently misses it. The prompt helps but doesn&rsquo;t eliminate the risk.</p>
<p><strong>Ollama adds a deployment.</strong> It&rsquo;s ~500MB image + the model weights (~1GB Q4). On a constrained single-node cluster that&rsquo;s real overhead. llama-server already covers general chat; Ollama is purely for this model&rsquo;s tool-calling support.</p>
<h2 id="source">Source</h2>
<p><a href="https://github.com/janos-gyorgy/llm-anonymizer">github.com/janos-gyorgy/llm-anonymizer</a> — MIT licensed, Kubernetes manifests and n8n workflow included.</p>
]]></content:encoded></item><item><title>📈 Observing Local LLM Inference: llama.cpp's Built-in Prometheus Metrics</title><link>https://blog.hippotion.com/posts/llm-observability-llamacpp-prometheus/</link><pubDate>Fri, 29 Aug 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/llm-observability-llamacpp-prometheus/</guid><description>llama.cpp&amp;rsquo;s inference server ships a /metrics endpoint. One flag, Prometheus scraping, a Grafana dashboard loaded via ConfigMap sidecar — AI observability without a proxy layer.</description><content:encoded><![CDATA[<h2 id="what-operating-an-llm-actually-means">What &ldquo;operating an LLM&rdquo; actually means</h2>
<p>Running a local model is easy. Understanding what it&rsquo;s doing is less so.</p>
<p>After deploying llama.cpp + Open WebUI on k3s (<a href="/posts/local-llm-k8s-no-gpu/">previous post</a>), I had a chat interface backed by a local model. What I didn&rsquo;t have: any visibility into how the model was behaving — whether requests were queuing, how fast tokens were being generated, how much of the context window was in use.</p>
<p>The instinct for this kind of problem is usually &ldquo;add a proxy layer.&rdquo; There are several tools in this space — LiteLLM being the most popular — that sit between the client and the inference server and record token counts, latency, and spend. I tried this first. LiteLLM OOMed at startup on a node already at 76% memory. Heavy Python import tree, not a lot of headroom.</p>
<p>The thing I&rsquo;d missed: llama.cpp ships a Prometheus metrics endpoint. No proxy required.</p>
<hr>
<h2 id="--metrics"><code>--metrics</code></h2>
<p>One additional argument to the inference server:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">args</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- -<span class="l">m</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">/models/Phi-3.5-mini-instruct-Q4_K_M.gguf</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">host</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;0.0.0.0&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">port</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;8080&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">ctx-size</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;4096&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="kc">n</span>-<span class="l">predict</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;1024&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">parallel</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;1&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">metrics       </span><span class="w"> </span><span class="c"># ← this</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">log-disable</span><span class="w">
</span></span></span></code></pre></div><p>After restart, <code>GET /metrics</code> on port 8080 returns valid Prometheus exposition format:</p>
<pre tabindex="0"><code># HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total 0

# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 0

# HELP llamacpp:requests_processing Number of requests processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
</code></pre><p>The full set of metrics:</p>
<table>
	<thead>
			<tr>
					<th>Metric</th>
					<th>Type</th>
					<th>What it measures</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td><code>llamacpp:prompt_tokens_total</code></td>
					<td>counter</td>
					<td>Input tokens processed (cumulative)</td>
			</tr>
			<tr>
					<td><code>llamacpp:tokens_predicted_total</code></td>
					<td>counter</td>
					<td>Output tokens generated (cumulative)</td>
			</tr>
			<tr>
					<td><code>llamacpp:prompt_tokens_seconds</code></td>
					<td>gauge</td>
					<td>Current prompt throughput (tok/s)</td>
			</tr>
			<tr>
					<td><code>llamacpp:predicted_tokens_seconds</code></td>
					<td>gauge</td>
					<td>Current generation throughput (tok/s)</td>
			</tr>
			<tr>
					<td><code>llamacpp:tokens_predicted_seconds_total</code></td>
					<td>counter</td>
					<td>Total time spent generating</td>
			</tr>
			<tr>
					<td><code>llamacpp:prompt_seconds_total</code></td>
					<td>counter</td>
					<td>Total time spent on prompts</td>
			</tr>
			<tr>
					<td><code>llamacpp:requests_processing</code></td>
					<td>gauge</td>
					<td>Requests currently being processed</td>
			</tr>
			<tr>
					<td><code>llamacpp:requests_deferred</code></td>
					<td>gauge</td>
					<td>Requests queued, waiting for a slot</td>
			</tr>
			<tr>
					<td><code>llamacpp:n_decode_total</code></td>
					<td>counter</td>
					<td>Total llama_decode() calls</td>
			</tr>
			<tr>
					<td><code>llamacpp:n_busy_slots_per_decode</code></td>
					<td>counter</td>
					<td>Slots active per decode call</td>
			</tr>
	</tbody>
</table>
<p>These cover the metrics that matter for a personal inference server: throughput, latency (derivable from total time / total tokens), and queue depth.</p>
<hr>
<h2 id="prometheus-scrape-config">Prometheus scrape config</h2>
<p>Adding a static scrape target in the existing Prometheus configuration:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">extraScrapeConfigs</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd">
</span></span></span><span class="line"><span class="cl"><span class="sd">  - job_name: llama-server
</span></span></span><span class="line"><span class="cl"><span class="sd">    static_configs:
</span></span></span><span class="line"><span class="cl"><span class="sd">      - targets:
</span></span></span><span class="line"><span class="cl"><span class="sd">          - llama-server.web-ai-engine.svc:8080
</span></span></span><span class="line"><span class="cl"><span class="sd">    metrics_path: /metrics</span><span class="w">
</span></span></span></code></pre></div><p>The only non-obvious thing here is the network policy: Prometheus lives in <code>dashboard-homelab</code>, and llama-server lives in <code>web-ai-engine</code>. With Cilium network policies enforcing namespace isolation, the dashboard namespace needs to be allowed to make inbound connections to the AI engine namespace. In <code>applications.yml</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl">- <span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">web-ai-engine</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">networkPolicies</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">allowIngressFromNamespaces</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">dashboard-homelab]</span><span class="w">
</span></span></span></code></pre></div><p>Without this, Prometheus scrape attempts fail silently with a timeout.</p>
<hr>
<h2 id="grafana-dashboard-via-configmap">Grafana dashboard via ConfigMap</h2>
<p>Rather than importing a dashboard JSON manually through the Grafana UI, the Grafana sidecar handles it automatically. Any ConfigMap with the label <code>grafana_dashboard: &quot;1&quot;</code> is picked up, loaded, and available in Grafana — across all namespaces by default.</p>
<p>The dashboard ConfigMap lives in <code>web-ai-engine</code>, not <code>dashboard-homelab</code>. The sidecar finds it regardless:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">v1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">ConfigMap</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">grafana-dashboard-llm</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">web-ai-engine</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">labels</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">grafana_dashboard</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;1&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">data</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">llm-metrics.json</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd">
</span></span></span><span class="line"><span class="cl"><span class="sd">    {
</span></span></span><span class="line"><span class="cl"><span class="sd">      &#34;title&#34;: &#34;LLM Metrics&#34;,
</span></span></span><span class="line"><span class="cl"><span class="sd">      &#34;uid&#34;: &#34;llm-metrics&#34;,
</span></span></span><span class="line"><span class="cl"><span class="sd">      ...
</span></span></span><span class="line"><span class="cl"><span class="sd">    }</span><span class="w">
</span></span></span></code></pre></div><p>Argo CD reconciles the ConfigMap. The Grafana sidecar picks it up. The dashboard appears. No manual steps, no Grafana UI interaction, no state outside Git.</p>
<p>This means the dashboard is version-controlled, reproducible on cluster rebuild, and consistent across environments. The same YAML that describes the app&rsquo;s Kubernetes resources also describes what the monitoring looks like.</p>
<hr>
<h2 id="what-the-dashboard-shows">What the dashboard shows</h2>
<p>After sending a few messages through Open WebUI:</p>
<p><strong>Generation throughput</strong> — the <code>llamacpp:predicted_tokens_seconds</code> gauge drops to 0 between requests and spikes during generation. On this hardware (Intel N100, CPU-only inference, Phi-3.5-mini Q4_K_M), it reads 3–5 tok/s during active generation. This is the number to watch if you&rsquo;re comparing models or quantisation levels.</p>
<p><strong>Cumulative tokens</strong> — <code>llamacpp:prompt_tokens_total</code> and <code>llamacpp:tokens_predicted_total</code> both increase monotonically. The ratio between them is roughly the input/output ratio of your usage pattern. For conversational use it&rsquo;s typically 3:1 prompt to generation; for summarisation tasks it flips.</p>
<p><strong>Queue depth</strong> — <code>llamacpp:requests_deferred</code> is 0 almost always, which is expected with <code>--parallel 1</code>. If it&rsquo;s consistently above 0, you have more concurrent users than the server can handle with the current slot configuration.</p>
<p><strong>ms/token</strong> — derived from <code>rate(llamacpp:tokens_predicted_seconds_total[5m]) / rate(llamacpp:tokens_predicted_total[5m]) * 1000</code>. This is the per-token latency, which is the number that governs whether the response feels fast or slow. 200–300ms/token feels instant; above 400ms you start noticing.</p>
<hr>
<h2 id="whats-missing-compared-to-a-proxy-layer">What&rsquo;s missing compared to a proxy layer</h2>
<p>LiteLLM and similar proxies give you things this setup doesn&rsquo;t:</p>
<ul>
<li><strong>Per-model routing</strong> — if you&rsquo;re running multiple models, a proxy can route requests to the right one. With a single model, irrelevant.</li>
<li><strong>Virtual API keys</strong> — per-user or per-application key scoping. Not needed when the whole thing is behind SSO.</li>
<li><strong>Spend tracking</strong> — meaningful when you&rsquo;re paying per token. For a local model, the cost is electricity, which Prometheus already covers through the power monitoring dashboard.</li>
</ul>
<p>For a single-model homelab, the native metrics are sufficient. If I add more models later or need per-user attribution, a proxy layer becomes worth the RAM.</p>
<hr>
<h2 id="the-pattern">The pattern</h2>
<p>The broader point is that the observable unit here isn&rsquo;t the proxy — it&rsquo;s the inference server itself. Scraping llama.cpp directly means the metrics survive proxy changes, backend swaps, or routing redesigns. The inference server is the thing doing the work; it&rsquo;s the right place to measure.</p>
<p>Starter manifests with the metrics configuration included: <a href="https://github.com/janos-gyorgy/homelab-ai-inference-starter">homelab-ai-inference-starter</a></p>
]]></content:encoded></item><item><title>🤖 Local LLM Inference on Kubernetes, No GPU Required</title><link>https://blog.hippotion.com/posts/local-llm-k8s-no-gpu/</link><pubDate>Fri, 15 Aug 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/local-llm-k8s-no-gpu/</guid><description>A CPU-only self-hosted LLM stack running on k3s: llama.cpp as the inference server, Open WebUI as the chat interface, deployed as a single Git push.</description><content:encoded><![CDATA[<h2 id="the-gpu-assumption">The GPU assumption</h2>
<p>Most write-ups about self-hosting LLMs start with a GPU. A 3090, an A100, at minimum something with CUDA. The implication is that without one you&rsquo;re wasting your time — inference will be too slow to be useful.</p>
<p>That&rsquo;s not been my experience.</p>
<p>I&rsquo;ve been running a local LLM stack on a ThinkCentre mini PC (Intel N100, 16 GB RAM, no discrete GPU) for a few months. The model is Phi-3.5-mini-instruct, 3.8 billion parameters, 4-bit quantised. Response time is 3–6 tokens per second on CPU — slow enough that you notice it, fast enough that you use it. For the things I actually reach for a local model to do — rephrase something, summarise a document, explain a config option without sending it to an external API — the latency is fine.</p>
<p>The point isn&rsquo;t that CPU inference beats GPU inference. It&rsquo;s that &ldquo;good enough for personal use&rdquo; is a much lower bar than &ldquo;production LLM serving&rdquo;, and the hardware you already have probably clears it.</p>
<hr>
<h2 id="the-stack">The stack</h2>
<p>Two components:</p>
<p><strong>llama.cpp</strong> (<code>ghcr.io/ggml-org/llama.cpp:server</code>) — inference server that loads a GGUF model file and exposes an OpenAI-compatible REST API. No Python, no framework overhead, minimal memory footprint beyond the model itself.</p>
<p><strong>Open WebUI</strong> (<code>ghcr.io/open-webui/open-webui</code>) — a polished chat interface that speaks OpenAI API format. It points at the llama-server endpoint as its backend, handles conversation history, and supports RAG file uploads out of the box.</p>
<p>The architecture is simple on purpose:</p>
<pre tabindex="0"><code>Browser → Open WebUI (:80)
              │
              │  OpenAI-compatible API
              ▼
         llama-server (:8080)
              │
              │  reads GGUF model file
              ▼
         hostPath /srv/ai-models
</code></pre><p>Open WebUI doesn&rsquo;t know or care that the backend is llama.cpp running on CPU. It sees an OpenAI-compatible API. This matters: if I swap llama-server for Ollama, vLLM, or a cloud endpoint, the frontend doesn&rsquo;t change. The interface is the standard.</p>
<hr>
<h2 id="model-choice">Model choice</h2>
<p>GGUF models on Hugging Face are available at multiple quantisation levels. The trade-off is quality vs. RAM:</p>
<table>
	<thead>
			<tr>
					<th>Model</th>
					<th>Quant</th>
					<th>Size</th>
					<th>RAM at runtime</th>
					<th>Notes</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>Llama-3.2-3B</td>
					<td>Q4_K_M</td>
					<td>~2 GB</td>
					<td>~3 GB</td>
					<td>Fastest, lowest quality</td>
			</tr>
			<tr>
					<td>Phi-3.5-mini</td>
					<td>Q4_K_M</td>
					<td>~2.4 GB</td>
					<td>~3–4 GB</td>
					<td>Good balance — what I use</td>
			</tr>
			<tr>
					<td>Mistral-7B-Instruct</td>
					<td>Q4_K_M</td>
					<td>~4.1 GB</td>
					<td>~5–6 GB</td>
					<td>Noticeably better, needs more RAM</td>
			</tr>
			<tr>
					<td>Llama-3.1-8B</td>
					<td>Q4_K_M</td>
					<td>~4.7 GB</td>
					<td>~6–8 GB</td>
					<td>High quality, stretches 16 GB with other workloads</td>
			</tr>
	</tbody>
</table>
<p>On 16 GB RAM with a full k3s stack running alongside (Argo CD, Traefik, Vault, Prometheus, etc.), Phi-3.5-mini leaves enough headroom that the cluster stays stable. Mistral-7B works too, but it&rsquo;s tighter.</p>
<p>Models live in <code>/srv/ai-models</code> on the node, mounted into the pod as a <code>hostPath</code> volume. Single-node homelab, so there&rsquo;s no scheduling concern. Download once with <code>wget</code>, done.</p>
<hr>
<h2 id="key-configuration-choices">Key configuration choices</h2>
<p><strong>Context size (<code>--ctx-size 4096</code>):</strong> How many tokens the model holds in its attention window. Larger context = more RAM + slower inference. 4096 is fine for conversational use. If you&rsquo;re summarising long documents, bump to 8192 and watch your RAM usage.</p>
<p><strong>Max output tokens (<code>--n-predict 1024</code>):</strong> Hard cap on response length. llama.cpp will stop there even mid-sentence. 1024 is usually enough; increase if you find it cutting off long explanations.</p>
<p><strong>Parallel slots (<code>--parallel 1</code>):</strong> How many concurrent inference requests the server handles. On CPU there&rsquo;s no benefit to more than 1 — each slot competes for the same cores. Leave it at 1.</p>
<p><strong>Memory limits:</strong> Set the container limit to roughly 2× the model&rsquo;s file size. A 2.4 GB GGUF typically uses 3–4 GB at runtime with context loaded.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">resources</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">requests</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="l">500m</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">memory</span><span class="p">:</span><span class="w"> </span><span class="l">1Gi</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">limits</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">memory</span><span class="p">:</span><span class="w"> </span><span class="l">6Gi</span><span class="w">
</span></span></span></code></pre></div><p>No CPU limit. llama-server will use however many cores are available during inference — that&rsquo;s what makes it usable. A CPU limit would throttle inference to unusable speeds.</p>
<hr>
<h2 id="deployment-as-a-gitops-push">Deployment as a GitOps push</h2>
<p>The whole stack lives in one YAML values file, deployed through the <a href="https://github.com/janos-gyorgy/gitops-extra-objects-chart">extra-objects chart</a> that I use for raw manifests across the cluster. Argo CD watches the repo and reconciles automatically.</p>
<p>Nothing was <code>kubectl apply</code>-ed. The deployment happened by pushing to Git.</p>
<p>What that means in practice: when I bumped the Open WebUI image version, I changed one line, pushed, and Argo CD rolled the pod. No manual steps, no SSH, no <code>kubectl</code>. The same process I use for any other service in the cluster.</p>
<p>The namespace, network policies, service account, and RBAC all generate from a single entry in <code>applications.yml</code> — same as every other app. The AI inference stack isn&rsquo;t special from an operations perspective.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># applications.yml excerpt</span><span class="w">
</span></span></span><span class="line"><span class="cl">- <span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">web-ai-engine</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">applications</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="nt">applicationCode</span><span class="p">:</span><span class="w"> </span><span class="l">web-ai-engine</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">path</span><span class="p">:</span><span class="w"> </span><span class="l">helm-charts/extra-objects</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">autoSync</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
</span></span></span></code></pre></div><hr>
<h2 id="access-and-auth">Access and auth</h2>
<p>The service is exposed at <code>ai.hippotion.com</code> through the same dual-path ingress setup I use everywhere: Cloudflare Tunnel for external access, direct-to-server via Pi-hole DNS for local access, Traefik handling both with a wildcard Let&rsquo;s Encrypt cert. See <a href="/posts/homelab-dual-path-tls/">that post</a> for the full explanation.</p>
<p>Auth is handled by Traefik&rsquo;s ForwardAuth middleware pointing at an oauth2-proxy backed by GitLab. Open WebUI&rsquo;s own auth is disabled (<code>WEBUI_AUTH: false</code>) — the OAuth layer upstream handles it. One login covers every service in the cluster.</p>
<p>The <code>WEBUI_SECRET_KEY</code> (used to sign Open WebUI sessions) comes from Vault via External Secrets Operator. Nothing sensitive in Git.</p>
<hr>
<h2 id="what-the-day-to-day-is-actually-like">What the day-to-day is actually like</h2>
<p>Slow is the obvious caveat. Phi-3.5-mini at 3–6 tok/s means a paragraph-length response takes 20–30 seconds. For coding help where you&rsquo;re reading what came before while it generates, that&rsquo;s fine. For quick factual lookups, it&rsquo;s a little tedious.</p>
<p>The useful cases for a local model, for me:</p>
<ul>
<li><strong>Rephrasing or editing text</strong> — paste something, ask it to tighten it. No data leaves the house.</li>
<li><strong>Config explanation</strong> — paste a Kubernetes manifest or a Traefik config block, ask what it does. Again, stays local.</li>
<li><strong>Quick summaries</strong> — short documents, log snippets, error messages.</li>
<li><strong>Experimentation</strong> — trying prompting techniques, testing system prompts, benchmarking quantisation levels without API costs.</li>
</ul>
<p>For longer reasoning tasks I use a cloud model. The local stack is for the cases where I want the answer to stay on-premises, or where I&rsquo;m iterating and don&rsquo;t want to pay per token.</p>
<hr>
<h2 id="the-starting-point-if-you-want-to-try-it">The starting point if you want to try it</h2>
<p>The manifests are on GitHub: <a href="https://github.com/janos-gyorgy/homelab-ai-inference-starter">homelab-ai-inference-starter</a></p>
<p>It includes the llama-server and Open WebUI deployments, resource configuration, and ingress options for Traefik and nginx. The README walks through downloading a model, applying the manifests, and the configuration knobs worth knowing.</p>
<p>No GPU required. The ThinkCentre in the corner of my desk does the job.</p>
]]></content:encoded></item></channel></rss>