Fastapi on hippotion

🔒 Building a PII Guardrail Proxy for Cloud LLM Calls

Fri, 26 Sep 2025 00:00:00 +0000

The problem with cloud LLM access

Running a local model is great for privacy. But local models hit a ceiling — for the heavy lifting, you want a cloud API like NVIDIA NIM with Llama 3.3 70B.

The moment you open that channel, you have a new risk: what if someone (or some automation) accidentally pastes a password, a private key, or someone’s personal data into the chat? It leaves the cluster. It’s logged somewhere you don’t control.

The standard answer is “train your users.” I’d rather have a technical control.

The architecture

Open WebUI → ai-guard proxy
                 │
        ┌────────┴────────┐
        │                 │
  llama-server       if SAFE:
  (classify)         forward to NVIDIA NIM
        │
   if SENSITIVE:
   block + explain

Every request to NVIDIA NIM goes through ai-guard first. ai-guard pulls the user message, sends it to the local llama.cpp server with a classification prompt, and makes a binary decision:

SAFE → forward to NVIDIA NIM with the real API key (which ai-guard holds, not the client)
SENSITIVE: → return HTTP 400, log the block, nothing leaves the cluster

The local model is already running for inference — this reuses it as a privacy gatekeeper at zero extra infrastructure cost.

The implementation

The proxy is ~150 lines of FastAPI. The classifier call:

CLASSIFIER_PROMPT = """You are a data security classifier. Check if the text below contains sensitive information:
passwords, API keys, tokens, credentials, personal identifiable information (names, emails, phone numbers, SSNs, addresses), financial data (card numbers, bank accounts), or private keys.

Reply with ONLY one of:
SAFE
SENSITIVE: 

Text to check:
"""

async def classify(text: str) -> tuple[bool, str]:
    async with httpx.AsyncClient(timeout=60) as client:
        resp = await client.post(
            f"{LLAMA_BASE}/chat/completions",
            json={
                "model": "phi-3.5-mini",
                "messages": [{"role": "user", "content": CLASSIFIER_PROMPT + text[:3000]}],
                "max_tokens": 30,
                "temperature": 0,
                "stream": False,
            },
            headers={"Authorization": "Bearer sk-no-key"},
        )
    answer = resp.json()["choices"][0]["message"]["content"].strip()
    if answer.upper().startswith("SENSITIVE"):
        reason = answer.split(":", 1)[1].strip() if ":" in answer else "sensitive content detected"
        return True, reason
    return False, ""

temperature=0 and max_tokens=30 keep the response deterministic and fast. The model only needs to output one word or one line.

The main handler:

@app.post("/v1/chat/completions")
async def proxy_chat(request: Request):
    body = await request.json()
    user_text = extract_user_text(body.get("messages", []))

    if user_text.strip():
        try:
            is_sensitive, reason = await classify(user_text)
        except Exception as exc:
            log.error("classifier error: %s — allowing request through", exc)
            is_sensitive = False

        if is_sensitive:
            return JSONResponse(status_code=400, content={
                "error": {
                    "message": f"Request blocked by ai-guard: {reason}. Remove sensitive content before sending to external models.",
                    "type": "content_policy_violation",
                }
            })

    # Safe — forward to upstream with streaming support
    ...

Fail-open: if the classifier itself errors (llama-server down, timeout), the request goes through and the error is logged. Fail-closed would be safer for high-stakes environments, but this is a homelab and I’d rather not block all cloud LLM access because the local model is warming up.

Kubernetes deployment

ai-guard runs in the same namespace as llama-server and Open WebUI (web-ai-engine). Intra-namespace traffic is always allowed in Cilium, so no new network policy needed.

Open WebUI uses semicolon-separated lists for multiple API backends:

- name: OPENAI_API_BASE_URLS
  value: "http://llama-server.web-ai-engine.svc:8080/v1;http://ai-guard.web-ai-engine.svc:8080/v1"
- name: OPENAI_API_KEYS
  value: "sk-no-key;sk-no-key"

The second entry is ai-guard. Open WebUI passes sk-no-key as the API key — ai-guard ignores it and uses its own UPSTREAM_API_KEY from a Kubernetes Secret (pulled from Vault via External Secrets Operator). The real NVIDIA API key never touches the client.

The latency tradeoff

The classification step adds 5–15 seconds on CPU inference. That’s the cost of keeping the check fully private — the classifier never sends data anywhere.

For a personal homelab assistant, this is fine. For a high-throughput production setup, you’d want the classifier on a GPU or a dedicated smaller model purpose-built for classification.

What it catches

The classifier prompt targets:

Passwords, API keys, tokens, credentials
PII: names, emails, phone numbers, SSNs, addresses
Financial data: card numbers, bank accounts
Private keys

False negatives are possible — no classifier is perfect. This is a first line of defense, not a compliance control. The value is catching the obvious, accidental leaks.

Source

github.com/janos-gyorgy/ai-guard — MIT licensed, Kubernetes manifests included.

🕵️ Privacy-Preserving LLM Pipelines: Anonymize Before You Send

Fri, 12 Sep 2025 00:00:00 +0000

The problem with blocking

The PII guardrail proxy I built last week works by classifying prompts and blocking the sensitive ones. That’s fine for a chat interface where a human can rephrase. It doesn’t work for automated pipelines.

If a Jira ticket contains someone’s name and an internal hostname, you don’t want the agent to fail — you want it to process the ticket without exposing that data. Blocking is the wrong primitive for pipelines. Anonymization is the right one.

The pattern

Input text
  → anonymizer: extract PII, replace with semantic fakes
  → "Nathan Chen from DataSoft LLC needs ProjectX fixed on dev.internal.net"
  + mapping: {"Nathan Chen" → "John Smith", "DataSoft LLC" → "ACME", ...}
  → cloud LLM: processes coherent text, never sees real values
  → "Nathan Chen should check the ProjectX docs with the DataSoft LLC team"
  → string substitution with reverse mapping
  → "John Smith should check the OAuth docs with the ACME team"

Two things that make this work:

Deanonymization needs no LLM. Once you have the mapping, restoring is pure string substitution. The model call only happens on the way in.

Semantic fakes beat placeholder tokens. An earlier version of this used [PERSON_1], [ORG_1] tokens. The problem: cloud models see bracketed text and subtly change behaviour — shorter responses, hedging, dropped context. When the cloud model sees Nathan Chen from DataSoft LLC, it treats it as real text and responds naturally. Quality is noticeably better.

Prior art — what already exists

This is a well-established pattern. Worth knowing what’s out there:

LLM Guard (Protect AI) — the most complete open-source implementation. Anonymize + Deanonymize scanner pair with a Vault for the mapping. Production-grade, actively maintained. Start here if you’re building this for anything serious.

Microsoft PII Shield — session-based proxy. Returns a session ID with the anonymized text, uses it to deanonymize the response.

anonLLM — uses GLiNER (a proper NER model) + Faker for realistic replacements. Better accuracy than a general chat model.

REDACT — IEEE paper describing a system using Ollama for PII redaction in documents.

HuggingFace Anonymizer SLM series — purpose-built models (0.6B/1.7B/4B) fine-tuned specifically for anonymization. 9.20/10 quality score for 1.7B, close to GPT-4.1’s 9.77.

That last one is what this implementation actually uses.

The model: Anonymizer-1.7B

eternisai/Anonymizer-1.7B is a Qwen3-1.7B fine-tune trained on ~30k anonymization samples using GRPO with GPT-4.1 as judge. It outputs structured tool calls instead of free text:

{
  "name": "replace_entities",
  "arguments": {
    "replacements": [
      {"original": "John Smith", "replacement": "Nathan Chen"},
      {"original": "ACME Corp", "replacement": "DataSoft LLC"},
      {"original": "auth.acme.internal", "replacement": "dev.internal.net"}
    ]
  }
}

No prompt engineering needed. The model knows exactly what it’s doing and outputs a structured contract. Compare that to the first version of this service, which sent a long JSON-format prompt to Phi-3.5-mini and hoped the output parsed correctly.

The model runs via Ollama (which handles the Qwen3 chat template and tool calling natively), pointed at the GGUF version from HuggingFace: hf.co/gabriellarson/Anonymizer-1.7B-GGUF.

The implementation

llm-anonymizer is a FastAPI service with two endpoints.

POST /anonymize — calls Ollama with the tool definition, parses the response:

TOOLS = [{
    "type": "function",
    "function": {
        "name": "replace_entities",
        "description": "Replace PII entities with anonymized versions",
        "parameters": {
            "type": "object",
            "properties": {
                "replacements": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "original": {"type": "string"},
                            "replacement": {"type": "string"},
                        },
                        "required": ["original", "replacement"],
                    },
                }
            },
            "required": ["replacements"],
        },
    },
}]

resp = await client.post(f"{OLLAMA_BASE}/api/chat", json={
    "model": MODEL,
    "messages": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": text + "\n/no_think"},  # skip Qwen3 thinking mode
    ],
    "tools": TOOLS,
    "stream": False,
})

tool_calls = resp.json()["message"]["tool_calls"]
replacements = tool_calls[0]["function"]["arguments"]["replacements"]

# Build reverse mapping: replacement → original (for deanonymization)
anonymized = text
mapping = {}
for pair in replacements:
    anonymized = anonymized.replace(pair["original"], pair["replacement"])
    mapping[pair["replacement"]] = pair["original"]

The /no_think suffix tells the model to skip its chain-of-thought — faster response, same accuracy for this task.

POST /deanonymize — no model call, just substitution:

for replacement, original in sorted(mapping.items(), key=lambda x: len(x[0]), reverse=True):
    text = text.replace(replacement, original)

Sorted by length descending so longer tokens don’t get partially overwritten by shorter ones.

The Kubernetes stack

Ollama runs as a separate deployment in the same namespace as everything else (web-ai-engine). Intra-namespace traffic is always allowed — no new network policies.

llm-anonymizer (FastAPI) → Ollama (port 11434) → Anonymizer-1.7B GGUF

One-time model pull after first deploy:

kubectl exec -n web-ai-engine deploy/ollama -- \
  ollama pull hf.co/gabriellarson/Anonymizer-1.7B-GGUF

Ollama caches it on a 10Gi PVC, so pod restarts don’t re-download.

The n8n pipeline

Five-node chain triggered by webhook:

Webhook → /anonymize → NVIDIA NIM → /deanonymize → Respond

The NVIDIA NIM call includes a system prompt instructing it to treat the text as normal input. No mention of tokens, no special handling — because the text looks like real text.

Wire any upstream source to the webhook: Jira event, Slack slash command, a scheduled job that processes internal docs. The pipeline is source-agnostic.

The caveats

1.7B isn’t GPT-4.1. The model scores 9.20/10 on the benchmark — which means roughly 1 in 10 cases has a missed or incorrect entity. Test with real examples from your domain before depending on it.

Deanonymization breaks on heavy rephrasing. If the cloud model restructures a sentence enough that the fake value no longer appears verbatim, the substitution silently misses it. The prompt helps but doesn’t eliminate the risk.

Ollama adds a deployment. It’s ~500MB image + the model weights (~1GB Q4). On a constrained single-node cluster that’s real overhead. llama-server already covers general chat; Ollama is purely for this model’s tool-calling support.

Source

github.com/janos-gyorgy/llm-anonymizer — MIT licensed, Kubernetes manifests and n8n workflow included.

📊 I Added a Stats Service to My Game to Answer One Question. It Multiplied.

Fri, 18 Jul 2025 00:00:00 +0000

The problem

I built Dice & Shrines with five asymmetric guardian characters. Each one has a different passive and active ability that changes how reinforcements distribute, which territories you can attack, and what happens when you take damage.

The question I couldn’t answer from just playing was: are they actually balanced?

Not “do they feel different” — they obviously do. But is Fox’s stored critical actually overpowered? Is Turtle’s loss-recovery passive strong enough to matter, or is it just flavour? Is there a first-mover advantage baked into the map structure?

You can’t answer questions like these from vibes. You need data. So I built a stats service.

What gets recorded

Every game produces five event types, posted as fire-and-forget HTTP calls from the game client to game-stats.hippotion.com/event:

map_generated — logged when the map generator accepts a map. Records territory count, average territory size, minimum size, and how many generation attempts it took. This tells me how often the generator discards its own work and whether the acceptance criteria are too strict.

game_start — fired when a game begins. Captures the number of players, the guardian assigned to each slot, and which slot is human. Returns a gameId that travels with the game for the rest of its life.

attack — fired on every single dice roll. Attacker, defender, from-territory, to-territory, how many dice each side had, what they rolled, who won. This is the raw material for the probability analysis.

elimination — fired when a player is knocked out. Records which guardian they were and how many players remained, so I can tell who exits first and who makes the final stand.

game_end — fired on win or abandon. Records the winner’s guardian, how many turns the game took, and whether it was abandoned.

The service is a FastAPI app backed by PostgreSQL, running in the homelab on the same k3s cluster as the game. About 150 lines of Python plus a schema.sql that the app runs on startup.

The dashboard

The stats dashboard is a single-page HTML response from / — self-contained, no external framework, chart.js for the visualisations. It polls /api/stats every 30 seconds and updates in place.

What it shows:

Overview cards: total games, games today, games this week, human win rate, average turns per game, overall attack win rate, abandoned game count.

Activity charts: games per day (last 7 days), game duration distribution in 10-turn buckets.

Death spiral analysis: when players abandon (broken into phases: instant, early, mid-early, mid, late), and first-mover advantage — win percentage by player slot 0 through 5.

Attack behaviour: the dice margin chart is the most interesting one. It shows attack volume and win rate for every possible attacker-dice-minus-defender-dice value, from strongly negative (attacker is outmatched) to strongly positive. Overlaid: a win rate line. You can see the actual probability curve emerging from real games and compare it to what the math predicts.

Guardian intelligence: win rate, pick count, average attacks per game, survival rate to turn 50+, and average turns per winning game — per guardian, human players only.

Elimination intelligence: when the first player gets knocked out per game, and a guardian fate table showing average elimination order and first-out percentage. Earliest-exiting guardian is surfaced explicitly.

Map influence: territory count versus average game length. Also an attack efficiency heatmap — win rate for every attacker-dice × defender-dice combination, 1 through 8, rendered as a colour grid.

Recent games: last 15 games with the human player’s guardian, result, and IP address so I can tell if it’s me testing or an actual player who wandered in.

What the data showed

The attack win rate across all games sits just under 60%. That’s higher than a naive analysis suggests it should be — if both sides roll fairly, equal dice should be near-even. The explanation is selection bias: players only attack when they have a dice advantage. Nobody sends 2 dice at 8 dice repeatedly. The average attack has a positive margin, so the average win rate is above 50%.

The margin chart made this explicit. The plurality of attacks have a margin of +2 or more. The sub-zero margin attacks — technically losing plays — are a real but small fraction, usually late-game desperation or deliberate tempo plays.

Human vs AI attack quality turned out to be the sharpest comparison. Humans and AI have different average margins. The AI is greedy but disciplined about attack selection; humans sometimes take gambles the AI wouldn’t. You can see it in the numbers.

First-mover advantage is measurable but not massive. Player slot 0 (goes first) has a slightly higher win rate than the average. Slots at the higher end of turn order are somewhat depressed. Not broken, but real — and a useful thing to watch if I ever add a competitive mode.

Guardian balance: the win rate gap between the best and worst guardian tells me whether the balance is within acceptable range or a concern. The dashboard calls it out explicitly: if the gap exceeds 15 percentage points, it flags it as a balance issue. That threshold is arbitrary, but it forces a decision rather than letting drift accumulate unnoticed.

Abandonment phases: most abandonments are instant — the player clicked “new game” before actually playing. The interesting number is mid-game abandonment, which is a proxy for death spirals: you see your income drop, you know you’re losing, you close the tab. That’s a design signal, not just a metric.

Designing for measurement

The useful insight from building this is that it changes how you design the game. Once you know every attack is being logged, you start thinking about what the attack data will tell you. Shrines give territories a guaranteed die — does that show up in attack margins near shrine territories? I didn’t add territory-topology tracking, but I could. The schema is just a few columns away.

The same goes for guardian abilities. Fox’s stored critical fires at turn boundaries — I log turn number on every attack, so I can look for Fox spikes in attack win rate on certain turns. I haven’t run that query yet, but the data is there if the balance question becomes sharp enough to need it.

That’s the thing about adding observability to something you built yourself: you stop guessing about whether it’s working and start reading the evidence. The game got more interesting to design once I could see what was actually happening inside it.

The stack

FastAPI — event intake and stats API, ~150 lines
PostgreSQL — five tables: maps, games, game_guardians, attacks, eliminations
chart.js — dashboard visualisations, loaded from CDN
k3s + Argo CD — deployed as a Kubernetes pod, Dockerised, managed GitOps alongside everything else on the homelab

Source at dice-n-shrines-stats.