Llm on hippotion

VoteWatch: How Your Representatives Voted — and Whether You'd Agree

Fri, 15 May 2026 00:00:00 +0000

Open data nobody opens

Every vote in the European Parliament and the Slovak National Council is public. The EU even ships it as a clean API. And almost nobody reads it, because the raw record is unreadable: “Návrh poslanca… ktorým sa dopĺňa zákon č. 581/2004 Z. z. … (tlač 1259) — tretie čítanie, hlasovanie o návrhu zákona ako o celku.” Multiply that by a few hundred votes a sitting. Transparency that no human can parse is transparency on paper only.

So I built VoteWatch — a small site on my homelab that turns the record into something a citizen can actually use: what was decided, who voted, and do you agree?

VoteWatch SK: each decision summarised in plain language, which parties voted how, and a Yes/No question whose live citizen tally sits next to how parliament actually voted — labelled agree or gap.

Two halves, one lopsided

The EU half was easy. HowTheyVote.eu already did the hard work and publishes roll-call votes as a clean, open-licensed API. You consume it; you don’t scrape it.

The Slovak half is where the real work lives — and the real value. nrsr.sk has no API. The HTML is the contract: a results listing, and per-vote pages where each MP appears next to a one-letter code ([Z] za, [P] proti, [?] zdržal sa). So the national half is a genuine scraper — the unglamorous kind that nobody maintains, which is exactly why a gap exists to fill. The unglamorous part is the moat.

From ten votes to one question

A single bill generates a pile of procedural roll-calls — shorten the debate, move to third reading, amendment block A, amendment block B, the bill as a whole. Ten rows that are really one decision. Nobody wants ten rows.

So the pipeline groups votes by bill, then asks an LLM (llama-3.3-70b on NVIDIA NIM) to do exactly one job: turn the bureaucratic titles into a plain headline, two sentences of summary, and one neutral Yes/No question a person can actually answer. Seven votes on the health-insurer bill collapse into: “Changes to the health-insurance law” → “Do you agree with the health-insurance bill?”

The rule that keeps it honest

Here’s the line I won’t cross, and it’s the whole reason I trust the result: the AI writes the prose, but it never decides a fact.

Which votes belong to one bill? Deterministic — parsed from the bill number.
Did it pass? Deterministic — read from the result row.
Which parties voted for, against, abstained? Deterministic — tallied from the per-MP record, shown as Za: SMER-SD, HLAS-SD, SNS · Zdržali sa: PS, KDH, SaS.

The model only touches language: the headline, the summary, the question. If it hallucinates, you get an awkward sentence — never a wrong vote count. And if the model fails entirely, the card falls back to the raw title. The facts come from the record; the model just makes the record legible. For civic data, that separation isn’t a nice-to-have — it’s the difference between a tool and a liability. (Every card says so out loud: summaries are AI-generated; the raw record prevails.)

The part that closes the loop

Showing people how their representatives voted is only half a feedback loop. The other half is letting them answer.

Each decision carries its one distilled question and two buttons — Áno / Nie. You vote, and the site shows the citizen tally next to how parliament actually decided, with the honest verdict on top: "✓ Citizens and Parliament agree" or "⚖ Gap between citizens and Parliament." That gap is the entire point. It’s the thesis behind a side project of mine called veracracy — governance measured against verified knowledge and the actual will of the governed — made concrete enough to click.

The same loop on the European Parliament — dossiers consolidated, political-group stances (EPP, S&D, PfE…), and the citizen poll under each topic.

The backend is deliberately boring. The site is static (git-synced nginx, same as this blog). Votes can’t POST to a static page, so they go to a public n8n webhook that records to a data table and returns live tallies — no new service, no database, just the automation box I already run. Vote keys are namespaced so EU and Slovak polls share one store without colliding.

The honest caveat

Dedup is browser-local. It stops casual double-voting, but behind a Cloudflare tunnel every request shares one IP, so this is an indicative signal, not a secured ballot. That’s the right altitude for “let people express an opinion.” The day it needs to mean more than that, it needs real identity first — and I’d rather ship the honest version than fake the robust one.

It’s live at votewatch.hippotion.com — the EU parliament and the Slovak NR SR, every MEP and every poslanec, in plain language, with a button that asks the only question that matters after a vote: would you have voted the same way?

A neutral record — what was decided and who decided it — not a villain list. Data © HowTheyVote.eu (ODbL) and nrsr.sk.

Mind the gap: I pointed monitoring at my own skill set

Fri, 27 Mar 2026 00:00:00 +0000

A while back I applied for a senior platform role at n8n and didn’t land it. Fair enough — but “fair enough” isn’t actionable. Rejections come with no logs, no metrics, no trace. For someone who runs thirty-odd services with full observability, having vibes as the only instrumentation on my own career felt architecturally embarrassing.

So I built mind-the-gap: a pipeline that measures what the market demands, diffs it against what I can prove, and renders the gap as a private dashboard on my cluster. The job hunt is now a monitored system. This post is about the non-obvious decisions.

Demand: an LLM reads job listings so I don’t have to

I already had a job poller — an n8n workflow that polls the public ATS APIs (Greenhouse / Lever / Ashby) of ~33 companies plus a broad remote-jobs feed every six hours. A sibling workflow now re-fetches the same boards and, for every listing that passes the role+location gate, asks a small hosted LLM (Llama-3.1-8B) for a structured extraction:

{"seniority": "senior", "skills": [{"name": "kubernetes", "importance": "must"}, ...]}

One row per (job, skill) lands in an n8n Data Table. Decisions that mattered:

One LLM call per job, not one batch. Free-tier inference times out on batches; per-job calls are slower but fail independently. A lesson the poller already paid for.
Insert doubles as the processed-marker. A job whose extraction fails to parse produces no rows — so it’s retried next run, for free. No status column, no second table.
Canonicalization in code, not in the prompt. The model says “K8s”, “k3s”, “EKS” on different days regardless of instructions. A dumb alias map (k8s→kubernetes, eks→aws) beats prompt engineering for consistency.
8B is good enough — with a guard. It occasionally echoed the seniority enum back literally ("junior|mid|senior|staff|lead|unspecified"). The fix is one line of validation, not a bigger model.

Supply: no artifact, no credit

The other side of the diff is a skills registry — markdown in my knowledge vault, with a machine-parseable YAML block. Every skill has a state, and the rule that keeps the whole thing honest is brutal: a skill counts as proven only if an artifact exists — a public repo, a blog post, documented production experience. Otherwise it’s claimed, and claimed earns half credit.

That rule immediately produced the most useful insight of the project: “invisible skill” is a real category. Python turned out to be the market’s #5 ask. I use it constantly — and could point to nothing public that shows it. The cheapest score increase isn’t learning something new; it’s a weekend making an existing skill visible. No gut-feeling gap analysis would have ranked “write about what you already do” above “learn the shiny thing.”

The score: distinct companies, not mentions

First naive aggregation: Canonical’s listings mention Ubuntu nine times, all marked must-have — suddenly Ubuntu looks like the hottest skill in Europe. Employer skew is the noise floor of small samples. The fix: demand weight = distinct companies naming the skill, not total mentions. One enthusiastic employer can’t move the radar.

Two more scoring rules I’d defend in review:

Skills named by fewer than two companies don’t count at all — single-listing noise stays out.
Demand the registry hasn’t classified yet shows up as “unreviewed” and counts fully against the score. An unreviewed market signal is a gap until proven otherwise; the dashboard nags me to triage it.

Rendering: the page is a git commit

The dashboard is a single static HTML file, and the pipeline that produces it never touches the cluster. render.js lives in this repo as the single source of truth; a nightly n8n workflow fetches it raw from GitLab, eval()s it against the Data Table rows and the registry, and — only if the result differs from what’s committed (timestamps stripped, or every night is a “change”) — PUTs the new index.html back via the GitLab API.

Serving is the same pattern as this blog: nginx plus a git-pull sidecar, deployed by Argo CD, behind the cluster’s OAuth middleware. The renderer has no kubeconfig, no SSH, no cluster access of any kind. GitLab stays the only source of truth — even for a page that rewrites itself nightly. If the workflow goes rogue, the worst it can do is a reviewable commit.

Day-one verdict

First run: 2,297 postings fetched, 25 in scope, 257 skill rows. Coverage score: 63%. Kubernetes and AWS tied at the top of demand — which means the AWS gap-closing project already in flight stopped being a hunch and became the measured top of the market. Go is the only top-ten demand with zero supply. The dashboard doesn’t get anyone a job; it just makes sure every learning Saturday is pointed where the data says, not where the hype does.

The job board rejected me. The data didn’t.

Workflows, render.js, and setup: github.com/janos-gyorgy/mind-the-gap.

🎯 Know the Market Without Job-Hunting: An LLM-Scored Job Poller in n8n

Fri, 13 Feb 2026 00:00:00 +0000

You don’t have to be about to change jobs to want to know the landscape. What’s being built, what it pays, where you’d actually fit — staying current on the market (and your own worth) is just good professional hygiene. The trouble is that checking is tedious, so most of us don’t, until we’re already job-hunting and starting cold.

So I automated mine. An n8n workflow on my homelab polls job boards every six hours, scores each new posting against my profile with an LLM, and emails me only the strong matches — the ones scoring 80%+. When it’s quiet, it’s silent. When something genuinely fits, I know the same day. Here’s what I learned building it. Repo at the bottom.

Three APIs cover most of the market

Company career pages look bespoke, but underneath, the vast majority run on one of three ATS — and all three hand you the jobs as unauthenticated JSON:

Greenhouse — boards-api.greenhouse.io/v1/boards/{token}/jobs?content=true
Lever — api.lever.co/v0/postings/{token}?mode=json
Ashby — api.ashbyhq.com/posting-api/job-board/{token}?includeCompensation=true

No scraping, no headless browser. You poll the API the page itself calls, normalize the three shapes into one { company, title, location, remote, url, posted_at, description, external_id }, and you’re done with the hard part.

“Resolve the token” is half the battle

The naive assumption — the token is the company name, and everyone’s on one of the three — is half right. When I probed my initial wishlist, roughly half 404’d everywhere: HashiCorp (now under IBM → Workday), SUSE (SuccessFactors), Aiven (Teamtailor), Hugging Face. They’re on a fourth or fifth system entirely. The honest move was to ship the ~33 that actually resolve and leave the rest as disabled config stubs. Verify before you trust a slug.

Dedup without a database

I didn’t want to stand up Postgres just to remember which jobs I’d already seen. n8n’s Data Tables handle it natively: a seen_jobs table, an external_id namespaced {ats}:{company}:{id}, and the rowNotExists operation drops anything already recorded. State lives inside n8n, backed up with it. Zero extra infrastructure.

The ordering matters: notify first, mark seen second. The insert only happens after the email sends, so a failed send retries next run instead of silently swallowing a posting.

The location filter is a trap

My first version kept everything that wasn’t explicitly US-based. The inbox filled with “Senior Platform Engineer — Spain (Remote)” and "… — United Kingdom (Remote)". Those aren’t remote-for-me — they’re remote if you live in Spain. Useless from where I sit.

The fix was to invert the logic. Keep only three things:

globally-remote / worldwide / anywhere,
pan-EU (EMEA / Europe / EU / EEA),
my own country.

…and drop single-country remote, even EU ones. Region and home matches win over the country deny-list, ambiguous locations are kept (a missed match is worse than one extra line to skim). That one change cut the noise more than anything else.

Let an LLM read the actual job

Keyword + location filtering gets you a candidate list, but it can’t tell a “Platform Engineer” who herds Kubernetes from a “Platform Engineer” who owns a Figma design system. The job description can.

So the last step scores each new posting against my CV. My first version batched all of them into one big LLM call — which promptly timed out on the free tier. The fix was the opposite: one small call per job, which also means a single slow or rate-limited job never sinks the batch. Each call asks a NVIDIA NIM model (Llama 3.1 8B, OpenAI-compatible) for one number and a reason:

Score this job 0–100 for fit against my profile. Return {score, reason}.

That score is what lets me widen the net instead of narrowing it. On top of the curated company list I pull a broad remote-jobs feed (every company, all categories); the cheap keyword + location filters do the first pass, then I only email the roles scoring 80%+. Casting wide is fine when a model is the bar at the door. A line ends up looking like:

92% — Grafana Labs — Senior Platform Engineer (Remote, EMEA) — strong k8s/GitOps overlap — link

Scoring is fail-safe: if a call hiccups, that job is just skipped, and every posting gets marked seen either way — so nothing re-scores forever, and a rare bad run never floods or stalls the inbox.

The unglamorous bits that make it trustworthy

One bad source can’t kill the run — every fetch is wrapped; failures become a ⚠️ N sources failing footer so a company quietly changing ATS is visible, not invisible.
A prime run seeds the table silently the first time, so I’m not buried under every currently-open role on day one.
Everything tunable lives in one Config node — companies, keywords, location lists, the profile, the model — so adding a company is a one-line edit, not a graph safari.

Takeaways

The “scrape job boards” problem mostly isn’t a scraping problem — it’s three public APIs and a normalizer.
For personal automation, reach for the boring-but-correct primitive: native dedup state beats a database you have to operate.
An LLM works best here as the bar at the door: cheap deterministic filters keep the candidate set (and the cost) small, then the model gates on real fit — which is what lets you cast a wide net without drowning in it.

Workflow JSON, the full node-by-node breakdown, and setup notes: github.com/janos-gyorgy/ats-job-poller.

🍵 I A/B-Tested Cloud vs Local LLMs in One n8n Agent. The Local One Faked It.

Fri, 07 Nov 2025 00:00:00 +0000

The question

I run n8n on my k3s homelab. Not docker-compose on a NUC — the full treatment: GitOps-reconciled, Vault-backed secrets, default-deny networking. The same boring platform everything else here runs on.

But “I have n8n running” proves nothing. I wanted to know if I actually understood it as an agent platform, and to answer a question I kept dodging: for agent work, do I need a cloud model, or is my local one good enough?

So I built a real agent and gave it two brains.

What I built

A chat assistant over brew-buddy, my homemade kombucha-tracking app (React + a small API + Postgres). You ask it things in plain language; it calls the app’s API and answers. The twist: the same question runs through two agents in parallel — one backed by NVIDIA’s hosted Llama-3.3-70B, one by a local Phi-3.5-mini on CPU — and the workflow prints both answers side by side.

Chat ──▶ Agent (cloud: NVIDIA 70B) ──┐   tools (shared):
     └─▶ Agent (local: Phi-3.5)   ──┤     • get_all_batches
                                    │     • get_batch_detail
                                    │     • brewing_statistics
            (Merge) ──▶ both replies, labeled     • add_batch_log   ⟵ write
                                                  • create_batch    ⟵ write

Both agents share the same read tools. The two write tools are wired to the cloud agent only — more on that below.

The nice part: I didn’t write a line of glue. n8n’s stock OpenAI Chat Model node talks to anything OpenAI-compatible if you override the credential’s Base URL — so one node points at https://integrate.api.nvidia.com/v1, the other at http://llama-server..svc:8080/v1 for the local server. Same node, two endpoints.

The infra that keeps it honest

I won’t re-explain the platform here — it’s in earlier posts: GitOps, Vault-backed secrets, default-deny networking, dual-path TLS ingress. But building the agent made one of them tangible.

n8n is, by design, a thing that makes arbitrary HTTP calls on a schedule. That’s exactly what you want behind a default-deny network policy. n8n couldn’t reach the brew-buddy API at all until I declared it — one line:

# n8n's namespace
allowEgressToNamespaces: [web-ai-engine, web-brew-buddy]
#                                          ^ added this for the agent

(plus a matching ingress-allow on brew-buddy’s side). That’s the posture working as intended: the blast radius of a workflow tool is whatever I’ve explicitly granted, and not one namespace more. Adding a capability is a reviewable one-liner in Git; Argo reconciles it. No kubectl, no guessing what n8n can reach.

The A/B: same agent, same tools, two brains

Plain “hi”. Cloud answers in ~0.5s. Local takes noticeably longer — because even for “hi”, the agent feeds the model the full system prompt plus the JSON schemas for every tool, and Phi-3.5 has to chew through all of it on CPU before it can say a word. So far, the boring expected result: local is slower.

Then I asked a real question, and the result flipped in a way I didn’t expect.

“What batches do I have?”

Cloud (70B) called get_all_batches, got the real rows, and answered:

You have two batches: 2026-04-09-A (cold-crash, 3L) and 2026-04-09-W (cold-crash, 3L).

Local (Phi-3.5) never called the tool. It didn’t seem to realise it had tools. Instead it confidently explained how I could go find the data myself:

To list all batches: 1. Access the brew-buddy app. 2. Look for a button labeled “List Batches”… def get_all_batches(): … … Remember, I’m unable to directly interact with apps or databases.

Fake instructions. Fake code. A polite apology. Everything except the actual answer it was sitting on top of.

Writing data. I asked both to log an observation. Cloud called add_batch_log and wrote a real row to Postgres (“I have recorded the observation…”). Local bluffed again — “here’s how you can log it yourself.”

Why it matters: capability, not latency

The interesting finding isn’t “the big model is better.” It’s how the small one fails.

With a ~3.8B model on CPU, the bottleneck for agent work isn’t speed — it’s capability. Phi-3.5 couldn’t reliably emit tool calls, so n8n’s tools never fired, and the model degraded into a chatbot that hallucinates a plausible answer instead of fetching the real one. That failure mode is worse than an error: an error you catch, a confident wrong answer you ship.

A couple of measurements that sharpened it:

NVIDIA 70B, plain chat: ~0.5s.
NVIDIA 70B, function-calling (with tool schemas): ~8.6s per round-trip — and an agent makes several round-trips per answer. That’s real latency you have to budget a timeout for. (It’s also why the cloud side initially timed out in n8n until I raised the model node’s timeout — the model was fine, n8n was cutting it off.)

So the snappy-vs-slow comparison flips depending on whether the question triggers tools. Plain chat: cloud wins on speed. Tool use: the local model is “fast” only because it skips the tools and makes something up. Speed was never the real axis.

The honest caveat: this is this small general model in a multi-tool agent loop. Purpose-built small models with tool-calling fine-tunes do better at narrow tasks — I run a 1.7B one elsewhere that emits a single structured tool call just fine. But for “pick the right tool from several and chain them,” 70B was in a different league.

The trust boundary

I gave the write tools (add_batch_log, create_batch) to the cloud agent only. The local agent is read-only — not by instruction, by wiring. Even if Phi-3.5 did decide to call a write tool, the connection isn’t there. The reliable model is the only one allowed to mutate real data, and that’s enforced structurally, not by trusting a prompt.

What’s toy and what’s real

Worth being straight: this is a single-node homelab. The agent and both model paths share one box. Running n8n on Kubernetes and swapping models isn’t novel — n8n’s own docs cover queue mode, where a main instance fans work out to a pool of worker pods you scale horizontally, with external Postgres for state. That’s the real production shape. Mine is one replica with an emptyDir’s worth of ambition.

What I think is worth sharing is the finding (the capability cliff, and that its failure mode is confident fabrication) and the boring thing underneath it: because the platform is default-deny and GitOps-reconciled, running this experiment cost me one reviewable egress line and zero risk to anything else.

The boring part is the point

The AI was the fun bit. But the reason I could bolt an agent onto a live cluster, point it at a real app, give it write access to one model and not the other, and tear it all down again — without worrying what it might touch — is that the infrastructure was already boring. Default-deny. Secrets out of Git. git push, Argo reconciles.

The model picks the tools. The platform decides what the tools can reach. Keep those two honest about each other and self-hosting an agent stops being scary and starts being just another app.

🔒 Building a PII Guardrail Proxy for Cloud LLM Calls

Fri, 26 Sep 2025 00:00:00 +0000

The problem with cloud LLM access

Running a local model is great for privacy. But local models hit a ceiling — for the heavy lifting, you want a cloud API like NVIDIA NIM with Llama 3.3 70B.

The moment you open that channel, you have a new risk: what if someone (or some automation) accidentally pastes a password, a private key, or someone’s personal data into the chat? It leaves the cluster. It’s logged somewhere you don’t control.

The standard answer is “train your users.” I’d rather have a technical control.

The architecture

Open WebUI → ai-guard proxy
                 │
        ┌────────┴────────┐
        │                 │
  llama-server       if SAFE:
  (classify)         forward to NVIDIA NIM
        │
   if SENSITIVE:
   block + explain

Every request to NVIDIA NIM goes through ai-guard first. ai-guard pulls the user message, sends it to the local llama.cpp server with a classification prompt, and makes a binary decision:

SAFE → forward to NVIDIA NIM with the real API key (which ai-guard holds, not the client)
SENSITIVE: → return HTTP 400, log the block, nothing leaves the cluster

The local model is already running for inference — this reuses it as a privacy gatekeeper at zero extra infrastructure cost.

The implementation

The proxy is ~150 lines of FastAPI. The classifier call:

CLASSIFIER_PROMPT = """You are a data security classifier. Check if the text below contains sensitive information:
passwords, API keys, tokens, credentials, personal identifiable information (names, emails, phone numbers, SSNs, addresses), financial data (card numbers, bank accounts), or private keys.

Reply with ONLY one of:
SAFE
SENSITIVE: 

Text to check:
"""

async def classify(text: str) -> tuple[bool, str]:
    async with httpx.AsyncClient(timeout=60) as client:
        resp = await client.post(
            f"{LLAMA_BASE}/chat/completions",
            json={
                "model": "phi-3.5-mini",
                "messages": [{"role": "user", "content": CLASSIFIER_PROMPT + text[:3000]}],
                "max_tokens": 30,
                "temperature": 0,
                "stream": False,
            },
            headers={"Authorization": "Bearer sk-no-key"},
        )
    answer = resp.json()["choices"][0]["message"]["content"].strip()
    if answer.upper().startswith("SENSITIVE"):
        reason = answer.split(":", 1)[1].strip() if ":" in answer else "sensitive content detected"
        return True, reason
    return False, ""

temperature=0 and max_tokens=30 keep the response deterministic and fast. The model only needs to output one word or one line.

The main handler:

@app.post("/v1/chat/completions")
async def proxy_chat(request: Request):
    body = await request.json()
    user_text = extract_user_text(body.get("messages", []))

    if user_text.strip():
        try:
            is_sensitive, reason = await classify(user_text)
        except Exception as exc:
            log.error("classifier error: %s — allowing request through", exc)
            is_sensitive = False

        if is_sensitive:
            return JSONResponse(status_code=400, content={
                "error": {
                    "message": f"Request blocked by ai-guard: {reason}. Remove sensitive content before sending to external models.",
                    "type": "content_policy_violation",
                }
            })

    # Safe — forward to upstream with streaming support
    ...

Fail-open: if the classifier itself errors (llama-server down, timeout), the request goes through and the error is logged. Fail-closed would be safer for high-stakes environments, but this is a homelab and I’d rather not block all cloud LLM access because the local model is warming up.

Kubernetes deployment

ai-guard runs in the same namespace as llama-server and Open WebUI (web-ai-engine). Intra-namespace traffic is always allowed in Cilium, so no new network policy needed.

Open WebUI uses semicolon-separated lists for multiple API backends:

- name: OPENAI_API_BASE_URLS
  value: "http://llama-server.web-ai-engine.svc:8080/v1;http://ai-guard.web-ai-engine.svc:8080/v1"
- name: OPENAI_API_KEYS
  value: "sk-no-key;sk-no-key"

The second entry is ai-guard. Open WebUI passes sk-no-key as the API key — ai-guard ignores it and uses its own UPSTREAM_API_KEY from a Kubernetes Secret (pulled from Vault via External Secrets Operator). The real NVIDIA API key never touches the client.

The latency tradeoff

The classification step adds 5–15 seconds on CPU inference. That’s the cost of keeping the check fully private — the classifier never sends data anywhere.

For a personal homelab assistant, this is fine. For a high-throughput production setup, you’d want the classifier on a GPU or a dedicated smaller model purpose-built for classification.

What it catches

The classifier prompt targets:

Passwords, API keys, tokens, credentials
PII: names, emails, phone numbers, SSNs, addresses
Financial data: card numbers, bank accounts
Private keys

False negatives are possible — no classifier is perfect. This is a first line of defense, not a compliance control. The value is catching the obvious, accidental leaks.

Source

github.com/janos-gyorgy/ai-guard — MIT licensed, Kubernetes manifests included.

🕵️ Privacy-Preserving LLM Pipelines: Anonymize Before You Send

Fri, 12 Sep 2025 00:00:00 +0000

The problem with blocking

The PII guardrail proxy I built last week works by classifying prompts and blocking the sensitive ones. That’s fine for a chat interface where a human can rephrase. It doesn’t work for automated pipelines.

If a Jira ticket contains someone’s name and an internal hostname, you don’t want the agent to fail — you want it to process the ticket without exposing that data. Blocking is the wrong primitive for pipelines. Anonymization is the right one.

The pattern

Input text
  → anonymizer: extract PII, replace with semantic fakes
  → "Nathan Chen from DataSoft LLC needs ProjectX fixed on dev.internal.net"
  + mapping: {"Nathan Chen" → "John Smith", "DataSoft LLC" → "ACME", ...}
  → cloud LLM: processes coherent text, never sees real values
  → "Nathan Chen should check the ProjectX docs with the DataSoft LLC team"
  → string substitution with reverse mapping
  → "John Smith should check the OAuth docs with the ACME team"

Two things that make this work:

Deanonymization needs no LLM. Once you have the mapping, restoring is pure string substitution. The model call only happens on the way in.

Semantic fakes beat placeholder tokens. An earlier version of this used [PERSON_1], [ORG_1] tokens. The problem: cloud models see bracketed text and subtly change behaviour — shorter responses, hedging, dropped context. When the cloud model sees Nathan Chen from DataSoft LLC, it treats it as real text and responds naturally. Quality is noticeably better.

Prior art — what already exists

This is a well-established pattern. Worth knowing what’s out there:

LLM Guard (Protect AI) — the most complete open-source implementation. Anonymize + Deanonymize scanner pair with a Vault for the mapping. Production-grade, actively maintained. Start here if you’re building this for anything serious.

Microsoft PII Shield — session-based proxy. Returns a session ID with the anonymized text, uses it to deanonymize the response.

anonLLM — uses GLiNER (a proper NER model) + Faker for realistic replacements. Better accuracy than a general chat model.

REDACT — IEEE paper describing a system using Ollama for PII redaction in documents.

HuggingFace Anonymizer SLM series — purpose-built models (0.6B/1.7B/4B) fine-tuned specifically for anonymization. 9.20/10 quality score for 1.7B, close to GPT-4.1’s 9.77.

That last one is what this implementation actually uses.

The model: Anonymizer-1.7B

eternisai/Anonymizer-1.7B is a Qwen3-1.7B fine-tune trained on ~30k anonymization samples using GRPO with GPT-4.1 as judge. It outputs structured tool calls instead of free text:

{
  "name": "replace_entities",
  "arguments": {
    "replacements": [
      {"original": "John Smith", "replacement": "Nathan Chen"},
      {"original": "ACME Corp", "replacement": "DataSoft LLC"},
      {"original": "auth.acme.internal", "replacement": "dev.internal.net"}
    ]
  }
}

No prompt engineering needed. The model knows exactly what it’s doing and outputs a structured contract. Compare that to the first version of this service, which sent a long JSON-format prompt to Phi-3.5-mini and hoped the output parsed correctly.

The model runs via Ollama (which handles the Qwen3 chat template and tool calling natively), pointed at the GGUF version from HuggingFace: hf.co/gabriellarson/Anonymizer-1.7B-GGUF.

The implementation

llm-anonymizer is a FastAPI service with two endpoints.

POST /anonymize — calls Ollama with the tool definition, parses the response:

TOOLS = [{
    "type": "function",
    "function": {
        "name": "replace_entities",
        "description": "Replace PII entities with anonymized versions",
        "parameters": {
            "type": "object",
            "properties": {
                "replacements": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "original": {"type": "string"},
                            "replacement": {"type": "string"},
                        },
                        "required": ["original", "replacement"],
                    },
                }
            },
            "required": ["replacements"],
        },
    },
}]

resp = await client.post(f"{OLLAMA_BASE}/api/chat", json={
    "model": MODEL,
    "messages": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": text + "\n/no_think"},  # skip Qwen3 thinking mode
    ],
    "tools": TOOLS,
    "stream": False,
})

tool_calls = resp.json()["message"]["tool_calls"]
replacements = tool_calls[0]["function"]["arguments"]["replacements"]

# Build reverse mapping: replacement → original (for deanonymization)
anonymized = text
mapping = {}
for pair in replacements:
    anonymized = anonymized.replace(pair["original"], pair["replacement"])
    mapping[pair["replacement"]] = pair["original"]

The /no_think suffix tells the model to skip its chain-of-thought — faster response, same accuracy for this task.

POST /deanonymize — no model call, just substitution:

for replacement, original in sorted(mapping.items(), key=lambda x: len(x[0]), reverse=True):
    text = text.replace(replacement, original)

Sorted by length descending so longer tokens don’t get partially overwritten by shorter ones.

The Kubernetes stack

Ollama runs as a separate deployment in the same namespace as everything else (web-ai-engine). Intra-namespace traffic is always allowed — no new network policies.

llm-anonymizer (FastAPI) → Ollama (port 11434) → Anonymizer-1.7B GGUF

One-time model pull after first deploy:

kubectl exec -n web-ai-engine deploy/ollama -- \
  ollama pull hf.co/gabriellarson/Anonymizer-1.7B-GGUF

Ollama caches it on a 10Gi PVC, so pod restarts don’t re-download.

The n8n pipeline

Five-node chain triggered by webhook:

Webhook → /anonymize → NVIDIA NIM → /deanonymize → Respond

The NVIDIA NIM call includes a system prompt instructing it to treat the text as normal input. No mention of tokens, no special handling — because the text looks like real text.

Wire any upstream source to the webhook: Jira event, Slack slash command, a scheduled job that processes internal docs. The pipeline is source-agnostic.

The caveats

1.7B isn’t GPT-4.1. The model scores 9.20/10 on the benchmark — which means roughly 1 in 10 cases has a missed or incorrect entity. Test with real examples from your domain before depending on it.

Deanonymization breaks on heavy rephrasing. If the cloud model restructures a sentence enough that the fake value no longer appears verbatim, the substitution silently misses it. The prompt helps but doesn’t eliminate the risk.

Ollama adds a deployment. It’s ~500MB image + the model weights (~1GB Q4). On a constrained single-node cluster that’s real overhead. llama-server already covers general chat; Ollama is purely for this model’s tool-calling support.

Source

github.com/janos-gyorgy/llm-anonymizer — MIT licensed, Kubernetes manifests and n8n workflow included.

📈 Observing Local LLM Inference: llama.cpp's Built-in Prometheus Metrics

Fri, 29 Aug 2025 00:00:00 +0000

What “operating an LLM” actually means

Running a local model is easy. Understanding what it’s doing is less so.

After deploying llama.cpp + Open WebUI on k3s (previous post), I had a chat interface backed by a local model. What I didn’t have: any visibility into how the model was behaving — whether requests were queuing, how fast tokens were being generated, how much of the context window was in use.

The instinct for this kind of problem is usually “add a proxy layer.” There are several tools in this space — LiteLLM being the most popular — that sit between the client and the inference server and record token counts, latency, and spend. I tried this first. LiteLLM OOMed at startup on a node already at 76% memory. Heavy Python import tree, not a lot of headroom.

The thing I’d missed: llama.cpp ships a Prometheus metrics endpoint. No proxy required.

`--metrics`

One additional argument to the inference server:

args:
  - -m
  - /models/Phi-3.5-mini-instruct-Q4_K_M.gguf
  - --host
  - "0.0.0.0"
  - --port
  - "8080"
  - --ctx-size
  - "4096"
  - --n-predict
  - "1024"
  - --parallel
  - "1"
  - --metrics        # ← this
  - --log-disable

After restart, GET /metrics on port 8080 returns valid Prometheus exposition format:

# HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total 0

# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 0

# HELP llamacpp:requests_processing Number of requests processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0

The full set of metrics:

Metric	Type	What it measures
`llamacpp:prompt_tokens_total`	counter	Input tokens processed (cumulative)
`llamacpp:tokens_predicted_total`	counter	Output tokens generated (cumulative)
`llamacpp:prompt_tokens_seconds`	gauge	Current prompt throughput (tok/s)
`llamacpp:predicted_tokens_seconds`	gauge	Current generation throughput (tok/s)
`llamacpp:tokens_predicted_seconds_total`	counter	Total time spent generating
`llamacpp:prompt_seconds_total`	counter	Total time spent on prompts
`llamacpp:requests_processing`	gauge	Requests currently being processed
`llamacpp:requests_deferred`	gauge	Requests queued, waiting for a slot
`llamacpp:n_decode_total`	counter	Total llama_decode() calls
`llamacpp:n_busy_slots_per_decode`	counter	Slots active per decode call

These cover the metrics that matter for a personal inference server: throughput, latency (derivable from total time / total tokens), and queue depth.

Prometheus scrape config

Adding a static scrape target in the existing Prometheus configuration:

extraScrapeConfigs: |
  - job_name: llama-server
    static_configs:
      - targets:
          - llama-server.web-ai-engine.svc:8080
    metrics_path: /metrics

The only non-obvious thing here is the network policy: Prometheus lives in dashboard-homelab, and llama-server lives in web-ai-engine. With Cilium network policies enforcing namespace isolation, the dashboard namespace needs to be allowed to make inbound connections to the AI engine namespace. In applications.yml:

- namespace: web-ai-engine
  networkPolicies:
    allowIngressFromNamespaces: [dashboard-homelab]

Without this, Prometheus scrape attempts fail silently with a timeout.

Grafana dashboard via ConfigMap

Rather than importing a dashboard JSON manually through the Grafana UI, the Grafana sidecar handles it automatically. Any ConfigMap with the label grafana_dashboard: "1" is picked up, loaded, and available in Grafana — across all namespaces by default.

The dashboard ConfigMap lives in web-ai-engine, not dashboard-homelab. The sidecar finds it regardless:

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboard-llm
  namespace: web-ai-engine
  labels:
    grafana_dashboard: "1"
data:
  llm-metrics.json: |
    {
      "title": "LLM Metrics",
      "uid": "llm-metrics",
      ...
    }

Argo CD reconciles the ConfigMap. The Grafana sidecar picks it up. The dashboard appears. No manual steps, no Grafana UI interaction, no state outside Git.

This means the dashboard is version-controlled, reproducible on cluster rebuild, and consistent across environments. The same YAML that describes the app’s Kubernetes resources also describes what the monitoring looks like.

What the dashboard shows

After sending a few messages through Open WebUI:

Generation throughput — the llamacpp:predicted_tokens_seconds gauge drops to 0 between requests and spikes during generation. On this hardware (Intel N100, CPU-only inference, Phi-3.5-mini Q4_K_M), it reads 3–5 tok/s during active generation. This is the number to watch if you’re comparing models or quantisation levels.

Cumulative tokens — llamacpp:prompt_tokens_total and llamacpp:tokens_predicted_total both increase monotonically. The ratio between them is roughly the input/output ratio of your usage pattern. For conversational use it’s typically 3:1 prompt to generation; for summarisation tasks it flips.

Queue depth — llamacpp:requests_deferred is 0 almost always, which is expected with --parallel 1. If it’s consistently above 0, you have more concurrent users than the server can handle with the current slot configuration.

ms/token — derived from rate(llamacpp:tokens_predicted_seconds_total[5m]) / rate(llamacpp:tokens_predicted_total[5m]) * 1000. This is the per-token latency, which is the number that governs whether the response feels fast or slow. 200–300ms/token feels instant; above 400ms you start noticing.

What’s missing compared to a proxy layer

LiteLLM and similar proxies give you things this setup doesn’t:

Per-model routing — if you’re running multiple models, a proxy can route requests to the right one. With a single model, irrelevant.
Virtual API keys — per-user or per-application key scoping. Not needed when the whole thing is behind SSO.
Spend tracking — meaningful when you’re paying per token. For a local model, the cost is electricity, which Prometheus already covers through the power monitoring dashboard.

For a single-model homelab, the native metrics are sufficient. If I add more models later or need per-user attribution, a proxy layer becomes worth the RAM.

The pattern

The broader point is that the observable unit here isn’t the proxy — it’s the inference server itself. Scraping llama.cpp directly means the metrics survive proxy changes, backend swaps, or routing redesigns. The inference server is the thing doing the work; it’s the right place to measure.

Starter manifests with the metrics configuration included: homelab-ai-inference-starter

🤖 Local LLM Inference on Kubernetes, No GPU Required

Fri, 15 Aug 2025 00:00:00 +0000

The GPU assumption

Most write-ups about self-hosting LLMs start with a GPU. A 3090, an A100, at minimum something with CUDA. The implication is that without one you’re wasting your time — inference will be too slow to be useful.

That’s not been my experience.

I’ve been running a local LLM stack on a ThinkCentre mini PC (Intel N100, 16 GB RAM, no discrete GPU) for a few months. The model is Phi-3.5-mini-instruct, 3.8 billion parameters, 4-bit quantised. Response time is 3–6 tokens per second on CPU — slow enough that you notice it, fast enough that you use it. For the things I actually reach for a local model to do — rephrase something, summarise a document, explain a config option without sending it to an external API — the latency is fine.

The point isn’t that CPU inference beats GPU inference. It’s that “good enough for personal use” is a much lower bar than “production LLM serving”, and the hardware you already have probably clears it.

The stack

Two components:

llama.cpp (ghcr.io/ggml-org/llama.cpp:server) — inference server that loads a GGUF model file and exposes an OpenAI-compatible REST API. No Python, no framework overhead, minimal memory footprint beyond the model itself.

Open WebUI (ghcr.io/open-webui/open-webui) — a polished chat interface that speaks OpenAI API format. It points at the llama-server endpoint as its backend, handles conversation history, and supports RAG file uploads out of the box.

The architecture is simple on purpose:

Browser → Open WebUI (:80)
              │
              │  OpenAI-compatible API
              ▼
         llama-server (:8080)
              │
              │  reads GGUF model file
              ▼
         hostPath /srv/ai-models

Open WebUI doesn’t know or care that the backend is llama.cpp running on CPU. It sees an OpenAI-compatible API. This matters: if I swap llama-server for Ollama, vLLM, or a cloud endpoint, the frontend doesn’t change. The interface is the standard.

Model choice

GGUF models on Hugging Face are available at multiple quantisation levels. The trade-off is quality vs. RAM:

Model	Quant	Size	RAM at runtime	Notes
Llama-3.2-3B	Q4_K_M	~2 GB	~3 GB	Fastest, lowest quality
Phi-3.5-mini	Q4_K_M	~2.4 GB	~3–4 GB	Good balance — what I use
Mistral-7B-Instruct	Q4_K_M	~4.1 GB	~5–6 GB	Noticeably better, needs more RAM
Llama-3.1-8B	Q4_K_M	~4.7 GB	~6–8 GB	High quality, stretches 16 GB with other workloads

On 16 GB RAM with a full k3s stack running alongside (Argo CD, Traefik, Vault, Prometheus, etc.), Phi-3.5-mini leaves enough headroom that the cluster stays stable. Mistral-7B works too, but it’s tighter.

Models live in /srv/ai-models on the node, mounted into the pod as a hostPath volume. Single-node homelab, so there’s no scheduling concern. Download once with wget, done.

Key configuration choices

Context size (--ctx-size 4096): How many tokens the model holds in its attention window. Larger context = more RAM + slower inference. 4096 is fine for conversational use. If you’re summarising long documents, bump to 8192 and watch your RAM usage.

Max output tokens (--n-predict 1024): Hard cap on response length. llama.cpp will stop there even mid-sentence. 1024 is usually enough; increase if you find it cutting off long explanations.

Parallel slots (--parallel 1): How many concurrent inference requests the server handles. On CPU there’s no benefit to more than 1 — each slot competes for the same cores. Leave it at 1.

Memory limits: Set the container limit to roughly 2× the model’s file size. A 2.4 GB GGUF typically uses 3–4 GB at runtime with context loaded.

resources:
  requests:
    cpu: 500m
    memory: 1Gi
  limits:
    memory: 6Gi

No CPU limit. llama-server will use however many cores are available during inference — that’s what makes it usable. A CPU limit would throttle inference to unusable speeds.

Deployment as a GitOps push

The whole stack lives in one YAML values file, deployed through the extra-objects chart that I use for raw manifests across the cluster. Argo CD watches the repo and reconciles automatically.

Nothing was kubectl apply-ed. The deployment happened by pushing to Git.

What that means in practice: when I bumped the Open WebUI image version, I changed one line, pushed, and Argo CD rolled the pod. No manual steps, no SSH, no kubectl. The same process I use for any other service in the cluster.

The namespace, network policies, service account, and RBAC all generate from a single entry in applications.yml — same as every other app. The AI inference stack isn’t special from an operations perspective.

# applications.yml excerpt
- namespace: web-ai-engine
  applications:
    - applicationCode: web-ai-engine
      path: helm-charts/extra-objects
      autoSync: true

Access and auth

The service is exposed at ai.hippotion.com through the same dual-path ingress setup I use everywhere: Cloudflare Tunnel for external access, direct-to-server via Pi-hole DNS for local access, Traefik handling both with a wildcard Let’s Encrypt cert. See that post for the full explanation.

Auth is handled by Traefik’s ForwardAuth middleware pointing at an oauth2-proxy backed by GitLab. Open WebUI’s own auth is disabled (WEBUI_AUTH: false) — the OAuth layer upstream handles it. One login covers every service in the cluster.

The WEBUI_SECRET_KEY (used to sign Open WebUI sessions) comes from Vault via External Secrets Operator. Nothing sensitive in Git.

What the day-to-day is actually like

Slow is the obvious caveat. Phi-3.5-mini at 3–6 tok/s means a paragraph-length response takes 20–30 seconds. For coding help where you’re reading what came before while it generates, that’s fine. For quick factual lookups, it’s a little tedious.

The useful cases for a local model, for me:

Rephrasing or editing text — paste something, ask it to tighten it. No data leaves the house.
Config explanation — paste a Kubernetes manifest or a Traefik config block, ask what it does. Again, stays local.
Quick summaries — short documents, log snippets, error messages.
Experimentation — trying prompting techniques, testing system prompts, benchmarking quantisation levels without API costs.

For longer reasoning tasks I use a cloud model. The local stack is for the cases where I want the answer to stay on-premises, or where I’m iterating and don’t want to pay per token.

The starting point if you want to try it

The manifests are on GitHub: homelab-ai-inference-starter

It includes the llama-server and Open WebUI deployments, resource configuration, and ingress options for Traefik and nginx. The README walks through downloading a model, applying the manifests, and the configuration knobs worth knowing.

No GPU required. The ThinkCentre in the corner of my desk does the job.