Privacy on hippotion

🔒 Building a PII Guardrail Proxy for Cloud LLM Calls

Fri, 26 Sep 2025 00:00:00 +0000

The problem with cloud LLM access

Running a local model is great for privacy. But local models hit a ceiling — for the heavy lifting, you want a cloud API like NVIDIA NIM with Llama 3.3 70B.

The moment you open that channel, you have a new risk: what if someone (or some automation) accidentally pastes a password, a private key, or someone’s personal data into the chat? It leaves the cluster. It’s logged somewhere you don’t control.

The standard answer is “train your users.” I’d rather have a technical control.

The architecture

Open WebUI → ai-guard proxy
                 │
        ┌────────┴────────┐
        │                 │
  llama-server       if SAFE:
  (classify)         forward to NVIDIA NIM
        │
   if SENSITIVE:
   block + explain

Every request to NVIDIA NIM goes through ai-guard first. ai-guard pulls the user message, sends it to the local llama.cpp server with a classification prompt, and makes a binary decision:

SAFE → forward to NVIDIA NIM with the real API key (which ai-guard holds, not the client)
SENSITIVE: → return HTTP 400, log the block, nothing leaves the cluster

The local model is already running for inference — this reuses it as a privacy gatekeeper at zero extra infrastructure cost.

The implementation

The proxy is ~150 lines of FastAPI. The classifier call:

CLASSIFIER_PROMPT = """You are a data security classifier. Check if the text below contains sensitive information:
passwords, API keys, tokens, credentials, personal identifiable information (names, emails, phone numbers, SSNs, addresses), financial data (card numbers, bank accounts), or private keys.

Reply with ONLY one of:
SAFE
SENSITIVE: 

Text to check:
"""

async def classify(text: str) -> tuple[bool, str]:
    async with httpx.AsyncClient(timeout=60) as client:
        resp = await client.post(
            f"{LLAMA_BASE}/chat/completions",
            json={
                "model": "phi-3.5-mini",
                "messages": [{"role": "user", "content": CLASSIFIER_PROMPT + text[:3000]}],
                "max_tokens": 30,
                "temperature": 0,
                "stream": False,
            },
            headers={"Authorization": "Bearer sk-no-key"},
        )
    answer = resp.json()["choices"][0]["message"]["content"].strip()
    if answer.upper().startswith("SENSITIVE"):
        reason = answer.split(":", 1)[1].strip() if ":" in answer else "sensitive content detected"
        return True, reason
    return False, ""

temperature=0 and max_tokens=30 keep the response deterministic and fast. The model only needs to output one word or one line.

The main handler:

@app.post("/v1/chat/completions")
async def proxy_chat(request: Request):
    body = await request.json()
    user_text = extract_user_text(body.get("messages", []))

    if user_text.strip():
        try:
            is_sensitive, reason = await classify(user_text)
        except Exception as exc:
            log.error("classifier error: %s — allowing request through", exc)
            is_sensitive = False

        if is_sensitive:
            return JSONResponse(status_code=400, content={
                "error": {
                    "message": f"Request blocked by ai-guard: {reason}. Remove sensitive content before sending to external models.",
                    "type": "content_policy_violation",
                }
            })

    # Safe — forward to upstream with streaming support
    ...

Fail-open: if the classifier itself errors (llama-server down, timeout), the request goes through and the error is logged. Fail-closed would be safer for high-stakes environments, but this is a homelab and I’d rather not block all cloud LLM access because the local model is warming up.

Kubernetes deployment

ai-guard runs in the same namespace as llama-server and Open WebUI (web-ai-engine). Intra-namespace traffic is always allowed in Cilium, so no new network policy needed.

Open WebUI uses semicolon-separated lists for multiple API backends:

- name: OPENAI_API_BASE_URLS
  value: "http://llama-server.web-ai-engine.svc:8080/v1;http://ai-guard.web-ai-engine.svc:8080/v1"
- name: OPENAI_API_KEYS
  value: "sk-no-key;sk-no-key"

The second entry is ai-guard. Open WebUI passes sk-no-key as the API key — ai-guard ignores it and uses its own UPSTREAM_API_KEY from a Kubernetes Secret (pulled from Vault via External Secrets Operator). The real NVIDIA API key never touches the client.

The latency tradeoff

The classification step adds 5–15 seconds on CPU inference. That’s the cost of keeping the check fully private — the classifier never sends data anywhere.

For a personal homelab assistant, this is fine. For a high-throughput production setup, you’d want the classifier on a GPU or a dedicated smaller model purpose-built for classification.

What it catches

The classifier prompt targets:

Passwords, API keys, tokens, credentials
PII: names, emails, phone numbers, SSNs, addresses
Financial data: card numbers, bank accounts
Private keys

False negatives are possible — no classifier is perfect. This is a first line of defense, not a compliance control. The value is catching the obvious, accidental leaks.

Source

github.com/janos-gyorgy/ai-guard — MIT licensed, Kubernetes manifests included.

🕵️ Privacy-Preserving LLM Pipelines: Anonymize Before You Send

Fri, 12 Sep 2025 00:00:00 +0000

The problem with blocking

The PII guardrail proxy I built last week works by classifying prompts and blocking the sensitive ones. That’s fine for a chat interface where a human can rephrase. It doesn’t work for automated pipelines.

If a Jira ticket contains someone’s name and an internal hostname, you don’t want the agent to fail — you want it to process the ticket without exposing that data. Blocking is the wrong primitive for pipelines. Anonymization is the right one.

The pattern

Input text
  → anonymizer: extract PII, replace with semantic fakes
  → "Nathan Chen from DataSoft LLC needs ProjectX fixed on dev.internal.net"
  + mapping: {"Nathan Chen" → "John Smith", "DataSoft LLC" → "ACME", ...}
  → cloud LLM: processes coherent text, never sees real values
  → "Nathan Chen should check the ProjectX docs with the DataSoft LLC team"
  → string substitution with reverse mapping
  → "John Smith should check the OAuth docs with the ACME team"

Two things that make this work:

Deanonymization needs no LLM. Once you have the mapping, restoring is pure string substitution. The model call only happens on the way in.

Semantic fakes beat placeholder tokens. An earlier version of this used [PERSON_1], [ORG_1] tokens. The problem: cloud models see bracketed text and subtly change behaviour — shorter responses, hedging, dropped context. When the cloud model sees Nathan Chen from DataSoft LLC, it treats it as real text and responds naturally. Quality is noticeably better.

Prior art — what already exists

This is a well-established pattern. Worth knowing what’s out there:

LLM Guard (Protect AI) — the most complete open-source implementation. Anonymize + Deanonymize scanner pair with a Vault for the mapping. Production-grade, actively maintained. Start here if you’re building this for anything serious.

Microsoft PII Shield — session-based proxy. Returns a session ID with the anonymized text, uses it to deanonymize the response.

anonLLM — uses GLiNER (a proper NER model) + Faker for realistic replacements. Better accuracy than a general chat model.

REDACT — IEEE paper describing a system using Ollama for PII redaction in documents.

HuggingFace Anonymizer SLM series — purpose-built models (0.6B/1.7B/4B) fine-tuned specifically for anonymization. 9.20/10 quality score for 1.7B, close to GPT-4.1’s 9.77.

That last one is what this implementation actually uses.

The model: Anonymizer-1.7B

eternisai/Anonymizer-1.7B is a Qwen3-1.7B fine-tune trained on ~30k anonymization samples using GRPO with GPT-4.1 as judge. It outputs structured tool calls instead of free text:

{
  "name": "replace_entities",
  "arguments": {
    "replacements": [
      {"original": "John Smith", "replacement": "Nathan Chen"},
      {"original": "ACME Corp", "replacement": "DataSoft LLC"},
      {"original": "auth.acme.internal", "replacement": "dev.internal.net"}
    ]
  }
}

No prompt engineering needed. The model knows exactly what it’s doing and outputs a structured contract. Compare that to the first version of this service, which sent a long JSON-format prompt to Phi-3.5-mini and hoped the output parsed correctly.

The model runs via Ollama (which handles the Qwen3 chat template and tool calling natively), pointed at the GGUF version from HuggingFace: hf.co/gabriellarson/Anonymizer-1.7B-GGUF.

The implementation

llm-anonymizer is a FastAPI service with two endpoints.

POST /anonymize — calls Ollama with the tool definition, parses the response:

TOOLS = [{
    "type": "function",
    "function": {
        "name": "replace_entities",
        "description": "Replace PII entities with anonymized versions",
        "parameters": {
            "type": "object",
            "properties": {
                "replacements": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "original": {"type": "string"},
                            "replacement": {"type": "string"},
                        },
                        "required": ["original", "replacement"],
                    },
                }
            },
            "required": ["replacements"],
        },
    },
}]

resp = await client.post(f"{OLLAMA_BASE}/api/chat", json={
    "model": MODEL,
    "messages": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": text + "\n/no_think"},  # skip Qwen3 thinking mode
    ],
    "tools": TOOLS,
    "stream": False,
})

tool_calls = resp.json()["message"]["tool_calls"]
replacements = tool_calls[0]["function"]["arguments"]["replacements"]

# Build reverse mapping: replacement → original (for deanonymization)
anonymized = text
mapping = {}
for pair in replacements:
    anonymized = anonymized.replace(pair["original"], pair["replacement"])
    mapping[pair["replacement"]] = pair["original"]

The /no_think suffix tells the model to skip its chain-of-thought — faster response, same accuracy for this task.

POST /deanonymize — no model call, just substitution:

for replacement, original in sorted(mapping.items(), key=lambda x: len(x[0]), reverse=True):
    text = text.replace(replacement, original)

Sorted by length descending so longer tokens don’t get partially overwritten by shorter ones.

The Kubernetes stack

Ollama runs as a separate deployment in the same namespace as everything else (web-ai-engine). Intra-namespace traffic is always allowed — no new network policies.

llm-anonymizer (FastAPI) → Ollama (port 11434) → Anonymizer-1.7B GGUF

One-time model pull after first deploy:

kubectl exec -n web-ai-engine deploy/ollama -- \
  ollama pull hf.co/gabriellarson/Anonymizer-1.7B-GGUF

Ollama caches it on a 10Gi PVC, so pod restarts don’t re-download.

The n8n pipeline

Five-node chain triggered by webhook:

Webhook → /anonymize → NVIDIA NIM → /deanonymize → Respond

The NVIDIA NIM call includes a system prompt instructing it to treat the text as normal input. No mention of tokens, no special handling — because the text looks like real text.

Wire any upstream source to the webhook: Jira event, Slack slash command, a scheduled job that processes internal docs. The pipeline is source-agnostic.

The caveats

1.7B isn’t GPT-4.1. The model scores 9.20/10 on the benchmark — which means roughly 1 in 10 cases has a missed or incorrect entity. Test with real examples from your domain before depending on it.

Deanonymization breaks on heavy rephrasing. If the cloud model restructures a sentence enough that the fake value no longer appears verbatim, the substitution silently misses it. The prompt helps but doesn’t eliminate the risk.

Ollama adds a deployment. It’s ~500MB image + the model weights (~1GB Q4). On a constrained single-node cluster that’s real overhead. llama-server already covers general chat; Ollama is purely for this model’s tool-calling support.

Source

github.com/janos-gyorgy/llm-anonymizer — MIT licensed, Kubernetes manifests and n8n workflow included.