<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Privacy on hippotion</title><link>https://blog.hippotion.com/tags/privacy/</link><description>Recent content in Privacy on hippotion</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Fri, 26 Sep 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.hippotion.com/tags/privacy/index.xml" rel="self" type="application/rss+xml"/><item><title>🔒 Building a PII Guardrail Proxy for Cloud LLM Calls</title><link>https://blog.hippotion.com/posts/ai-pii-guardrail-proxy/</link><pubDate>Fri, 26 Sep 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/ai-pii-guardrail-proxy/</guid><description>A local model classifies every prompt before it leaves the cluster. If it&amp;rsquo;s sensitive, it&amp;rsquo;s blocked. If it&amp;rsquo;s clean, it goes to NVIDIA NIM. 150 lines of FastAPI, deployed on k3s.</description><content:encoded><![CDATA[<h2 id="the-problem-with-cloud-llm-access">The problem with cloud LLM access</h2>
<p>Running a local model is great for privacy. But local models hit a ceiling — for the heavy lifting, you want a cloud API like NVIDIA NIM with Llama 3.3 70B.</p>
<p>The moment you open that channel, you have a new risk: what if someone (or some automation) accidentally pastes a password, a private key, or someone&rsquo;s personal data into the chat? It leaves the cluster. It&rsquo;s logged somewhere you don&rsquo;t control.</p>
<p>The standard answer is &ldquo;train your users.&rdquo; I&rsquo;d rather have a technical control.</p>
<h2 id="the-architecture">The architecture</h2>
<pre tabindex="0"><code>Open WebUI → ai-guard proxy
                 │
        ┌────────┴────────┐
        │                 │
  llama-server       if SAFE:
  (classify)         forward to NVIDIA NIM
        │
   if SENSITIVE:
   block + explain
</code></pre><p>Every request to NVIDIA NIM goes through ai-guard first. ai-guard pulls the user message, sends it to the local llama.cpp server with a classification prompt, and makes a binary decision:</p>
<ul>
<li><code>SAFE</code> → forward to NVIDIA NIM with the real API key (which ai-guard holds, not the client)</li>
<li><code>SENSITIVE: &lt;reason&gt;</code> → return HTTP 400, log the block, nothing leaves the cluster</li>
</ul>
<p>The local model is already running for inference — this reuses it as a privacy gatekeeper at zero extra infrastructure cost.</p>
<h2 id="the-implementation">The implementation</h2>
<p>The proxy is ~150 lines of FastAPI. The classifier call:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">CLASSIFIER_PROMPT</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;You are a data security classifier. Check if the text below contains sensitive information:
</span></span></span><span class="line"><span class="cl"><span class="s2">passwords, API keys, tokens, credentials, personal identifiable information (names, emails, phone numbers, SSNs, addresses), financial data (card numbers, bank accounts), or private keys.
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">Reply with ONLY one of:
</span></span></span><span class="line"><span class="cl"><span class="s2">SAFE
</span></span></span><span class="line"><span class="cl"><span class="s2">SENSITIVE: &lt;one-line reason&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">Text to check:
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">async</span> <span class="k">def</span> <span class="nf">classify</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="nb">bool</span><span class="p">,</span> <span class="nb">str</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">    <span class="k">async</span> <span class="k">with</span> <span class="n">httpx</span><span class="o">.</span><span class="n">AsyncClient</span><span class="p">(</span><span class="n">timeout</span><span class="o">=</span><span class="mi">60</span><span class="p">)</span> <span class="k">as</span> <span class="n">client</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">resp</span> <span class="o">=</span> <span class="k">await</span> <span class="n">client</span><span class="o">.</span><span class="n">post</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">LLAMA_BASE</span><span class="si">}</span><span class="s2">/chat/completions&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">json</span><span class="o">=</span><span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;model&#34;</span><span class="p">:</span> <span class="s2">&#34;phi-3.5-mini&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;messages&#34;</span><span class="p">:</span> <span class="p">[{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">CLASSIFIER_PROMPT</span> <span class="o">+</span> <span class="n">text</span><span class="p">[:</span><span class="mi">3000</span><span class="p">]}],</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;max_tokens&#34;</span><span class="p">:</span> <span class="mi">30</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;temperature&#34;</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;stream&#34;</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="p">},</span>
</span></span><span class="line"><span class="cl">            <span class="n">headers</span><span class="o">=</span><span class="p">{</span><span class="s2">&#34;Authorization&#34;</span><span class="p">:</span> <span class="s2">&#34;Bearer sk-no-key&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">answer</span> <span class="o">=</span> <span class="n">resp</span><span class="o">.</span><span class="n">json</span><span class="p">()[</span><span class="s2">&#34;choices&#34;</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="s2">&#34;message&#34;</span><span class="p">][</span><span class="s2">&#34;content&#34;</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">answer</span><span class="o">.</span><span class="n">upper</span><span class="p">()</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s2">&#34;SENSITIVE&#34;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">reason</span> <span class="o">=</span> <span class="n">answer</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">&#34;:&#34;</span><span class="p">,</span> <span class="mi">1</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> <span class="k">if</span> <span class="s2">&#34;:&#34;</span> <span class="ow">in</span> <span class="n">answer</span> <span class="k">else</span> <span class="s2">&#34;sensitive content detected&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span><span class="p">,</span> <span class="n">reason</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">False</span><span class="p">,</span> <span class="s2">&#34;&#34;</span>
</span></span></code></pre></div><p><code>temperature=0</code> and <code>max_tokens=30</code> keep the response deterministic and fast. The model only needs to output one word or one line.</p>
<p>The main handler:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="nd">@app.post</span><span class="p">(</span><span class="s2">&#34;/v1/chat/completions&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">async</span> <span class="k">def</span> <span class="nf">proxy_chat</span><span class="p">(</span><span class="n">request</span><span class="p">:</span> <span class="n">Request</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">body</span> <span class="o">=</span> <span class="k">await</span> <span class="n">request</span><span class="o">.</span><span class="n">json</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">user_text</span> <span class="o">=</span> <span class="n">extract_user_text</span><span class="p">(</span><span class="n">body</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&#34;messages&#34;</span><span class="p">,</span> <span class="p">[]))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">user_text</span><span class="o">.</span><span class="n">strip</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">        <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">is_sensitive</span><span class="p">,</span> <span class="n">reason</span> <span class="o">=</span> <span class="k">await</span> <span class="n">classify</span><span class="p">(</span><span class="n">user_text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">exc</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">log</span><span class="o">.</span><span class="n">error</span><span class="p">(</span><span class="s2">&#34;classifier error: </span><span class="si">%s</span><span class="s2"> — allowing request through&#34;</span><span class="p">,</span> <span class="n">exc</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">is_sensitive</span> <span class="o">=</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">is_sensitive</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">JSONResponse</span><span class="p">(</span><span class="n">status_code</span><span class="o">=</span><span class="mi">400</span><span class="p">,</span> <span class="n">content</span><span class="o">=</span><span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;error&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;message&#34;</span><span class="p">:</span> <span class="sa">f</span><span class="s2">&#34;Request blocked by ai-guard: </span><span class="si">{</span><span class="n">reason</span><span class="si">}</span><span class="s2">. Remove sensitive content before sending to external models.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;content_policy_violation&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="p">}</span>
</span></span><span class="line"><span class="cl">            <span class="p">})</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Safe — forward to upstream with streaming support</span>
</span></span><span class="line"><span class="cl">    <span class="o">...</span>
</span></span></code></pre></div><p>Fail-open: if the classifier itself errors (llama-server down, timeout), the request goes through and the error is logged. Fail-closed would be safer for high-stakes environments, but this is a homelab and I&rsquo;d rather not block all cloud LLM access because the local model is warming up.</p>
<h2 id="kubernetes-deployment">Kubernetes deployment</h2>
<p>ai-guard runs in the same namespace as llama-server and Open WebUI (<code>web-ai-engine</code>). Intra-namespace traffic is always allowed in Cilium, so no new network policy needed.</p>
<p>Open WebUI uses semicolon-separated lists for multiple API backends:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl">- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">OPENAI_API_BASE_URLS</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">value</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;http://llama-server.web-ai-engine.svc:8080/v1;http://ai-guard.web-ai-engine.svc:8080/v1&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl">- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">OPENAI_API_KEYS</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">value</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;sk-no-key;sk-no-key&#34;</span><span class="w">
</span></span></span></code></pre></div><p>The second entry is ai-guard. Open WebUI passes <code>sk-no-key</code> as the API key — ai-guard ignores it and uses its own <code>UPSTREAM_API_KEY</code> from a Kubernetes Secret (pulled from Vault via External Secrets Operator). The real NVIDIA API key never touches the client.</p>
<h2 id="the-latency-tradeoff">The latency tradeoff</h2>
<p>The classification step adds 5–15 seconds on CPU inference. That&rsquo;s the cost of keeping the check fully private — the classifier never sends data anywhere.</p>
<p>For a personal homelab assistant, this is fine. For a high-throughput production setup, you&rsquo;d want the classifier on a GPU or a dedicated smaller model purpose-built for classification.</p>
<h2 id="what-it-catches">What it catches</h2>
<p>The classifier prompt targets:</p>
<ul>
<li>Passwords, API keys, tokens, credentials</li>
<li>PII: names, emails, phone numbers, SSNs, addresses</li>
<li>Financial data: card numbers, bank accounts</li>
<li>Private keys</li>
</ul>
<p>False negatives are possible — no classifier is perfect. This is a first line of defense, not a compliance control. The value is catching the obvious, accidental leaks.</p>
<h2 id="source">Source</h2>
<p><a href="https://github.com/janos-gyorgy/ai-guard">github.com/janos-gyorgy/ai-guard</a> — MIT licensed, Kubernetes manifests included.</p>
]]></content:encoded></item><item><title>🕵️ Privacy-Preserving LLM Pipelines: Anonymize Before You Send</title><link>https://blog.hippotion.com/posts/llm-anonymizer-privacy-pipeline/</link><pubDate>Fri, 12 Sep 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/llm-anonymizer-privacy-pipeline/</guid><description>Replace PII with semantically realistic fakes before sending to a cloud LLM, then restore the originals from the response. Started with a general model and prompt engineering — then upgraded to a purpose-built 1.7B fine-tune via Ollama.</description><content:encoded><![CDATA[<h2 id="the-problem-with-blocking">The problem with blocking</h2>
<p>The <a href="/posts/ai-pii-guardrail-proxy/">PII guardrail proxy I built last week</a> works by classifying prompts and blocking the sensitive ones. That&rsquo;s fine for a chat interface where a human can rephrase. It doesn&rsquo;t work for automated pipelines.</p>
<p>If a Jira ticket contains someone&rsquo;s name and an internal hostname, you don&rsquo;t want the agent to fail — you want it to process the ticket without exposing that data. Blocking is the wrong primitive for pipelines. Anonymization is the right one.</p>
<h2 id="the-pattern">The pattern</h2>
<pre tabindex="0"><code>Input text
  → anonymizer: extract PII, replace with semantic fakes
  → &#34;Nathan Chen from DataSoft LLC needs ProjectX fixed on dev.internal.net&#34;
  + mapping: {&#34;Nathan Chen&#34; → &#34;John Smith&#34;, &#34;DataSoft LLC&#34; → &#34;ACME&#34;, ...}
  → cloud LLM: processes coherent text, never sees real values
  → &#34;Nathan Chen should check the ProjectX docs with the DataSoft LLC team&#34;
  → string substitution with reverse mapping
  → &#34;John Smith should check the OAuth docs with the ACME team&#34;
</code></pre><p>Two things that make this work:</p>
<p><strong>Deanonymization needs no LLM.</strong> Once you have the mapping, restoring is pure string substitution. The model call only happens on the way in.</p>
<p><strong>Semantic fakes beat placeholder tokens.</strong> An earlier version of this used <code>[PERSON_1]</code>, <code>[ORG_1]</code> tokens. The problem: cloud models see bracketed text and subtly change behaviour — shorter responses, hedging, dropped context. When the cloud model sees <code>Nathan Chen from DataSoft LLC</code>, it treats it as real text and responds naturally. Quality is noticeably better.</p>
<h2 id="prior-art--what-already-exists">Prior art — what already exists</h2>
<p>This is a well-established pattern. Worth knowing what&rsquo;s out there:</p>
<p><strong><a href="https://llm-guard.com/output_scanners/deanonymize/">LLM Guard</a></strong> (Protect AI) — the most complete open-source implementation. Anonymize + Deanonymize scanner pair with a Vault for the mapping. Production-grade, actively maintained. Start here if you&rsquo;re building this for anything serious.</p>
<p><strong><a href="https://techcommunity.microsoft.com/blog/azuredevcommunityblog/introducing-pii-shield-a-privacy-proxy-for-every-llm-call/4514726">Microsoft PII Shield</a></strong> — session-based proxy. Returns a session ID with the anonymized text, uses it to deanonymize the response.</p>
<p><strong><a href="https://github.com/fsndzomga/anonLLM">anonLLM</a></strong> — uses GLiNER (a proper NER model) + Faker for realistic replacements. Better accuracy than a general chat model.</p>
<p><strong><a href="https://ieeexplore.ieee.org/document/11140717/">REDACT</a></strong> — IEEE paper describing a system using Ollama for PII redaction in documents.</p>
<p><strong><a href="https://huggingface.co/blog/pratyushrt/anonymizerslm">HuggingFace Anonymizer SLM series</a></strong> — purpose-built models (0.6B/1.7B/4B) fine-tuned specifically for anonymization. 9.20/10 quality score for 1.7B, close to GPT-4.1&rsquo;s 9.77.</p>
<p>That last one is what this implementation actually uses.</p>
<h2 id="the-model-anonymizer-17b">The model: Anonymizer-1.7B</h2>
<p><a href="https://huggingface.co/eternisai/Anonymizer-1.7B">eternisai/Anonymizer-1.7B</a> is a Qwen3-1.7B fine-tune trained on ~30k anonymization samples using GRPO with GPT-4.1 as judge. It outputs structured tool calls instead of free text:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;replace_entities&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;arguments&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;replacements&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">      <span class="p">{</span><span class="nt">&#34;original&#34;</span><span class="p">:</span> <span class="s2">&#34;John Smith&#34;</span><span class="p">,</span> <span class="nt">&#34;replacement&#34;</span><span class="p">:</span> <span class="s2">&#34;Nathan Chen&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">      <span class="p">{</span><span class="nt">&#34;original&#34;</span><span class="p">:</span> <span class="s2">&#34;ACME Corp&#34;</span><span class="p">,</span> <span class="nt">&#34;replacement&#34;</span><span class="p">:</span> <span class="s2">&#34;DataSoft LLC&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">      <span class="p">{</span><span class="nt">&#34;original&#34;</span><span class="p">:</span> <span class="s2">&#34;auth.acme.internal&#34;</span><span class="p">,</span> <span class="nt">&#34;replacement&#34;</span><span class="p">:</span> <span class="s2">&#34;dev.internal.net&#34;</span><span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>No prompt engineering needed. The model knows exactly what it&rsquo;s doing and outputs a structured contract. Compare that to the first version of this service, which sent a long JSON-format prompt to Phi-3.5-mini and hoped the output parsed correctly.</p>
<p>The model runs via Ollama (which handles the Qwen3 chat template and tool calling natively), pointed at the GGUF version from HuggingFace: <code>hf.co/gabriellarson/Anonymizer-1.7B-GGUF</code>.</p>
<h2 id="the-implementation">The implementation</h2>
<p><code>llm-anonymizer</code> is a FastAPI service with two endpoints.</p>
<p><strong><code>POST /anonymize</code></strong> — calls Ollama with the tool definition, parses the response:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">TOOLS</span> <span class="o">=</span> <span class="p">[{</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;function&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;function&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;replace_entities&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;description&#34;</span><span class="p">:</span> <span class="s2">&#34;Replace PII entities with anonymized versions&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;parameters&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;object&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;properties&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;replacements&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;array&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;items&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                        <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;object&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                        <span class="s2">&#34;properties&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                            <span class="s2">&#34;original&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;string&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">                            <span class="s2">&#34;replacement&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;string&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">                        <span class="p">},</span>
</span></span><span class="line"><span class="cl">                        <span class="s2">&#34;required&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;original&#34;</span><span class="p">,</span> <span class="s2">&#34;replacement&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">                    <span class="p">},</span>
</span></span><span class="line"><span class="cl">                <span class="p">}</span>
</span></span><span class="line"><span class="cl">            <span class="p">},</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;required&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;replacements&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">        <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl"><span class="p">}]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">resp</span> <span class="o">=</span> <span class="k">await</span> <span class="n">client</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">OLLAMA_BASE</span><span class="si">}</span><span class="s2">/api/chat&#34;</span><span class="p">,</span> <span class="n">json</span><span class="o">=</span><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;model&#34;</span><span class="p">:</span> <span class="n">MODEL</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;messages&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;system&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">SYSTEM_PROMPT</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">        <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">text</span> <span class="o">+</span> <span class="s2">&#34;</span><span class="se">\n</span><span class="s2">/no_think&#34;</span><span class="p">},</span>  <span class="c1"># skip Qwen3 thinking mode</span>
</span></span><span class="line"><span class="cl">    <span class="p">],</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;tools&#34;</span><span class="p">:</span> <span class="n">TOOLS</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;stream&#34;</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">})</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">tool_calls</span> <span class="o">=</span> <span class="n">resp</span><span class="o">.</span><span class="n">json</span><span class="p">()[</span><span class="s2">&#34;message&#34;</span><span class="p">][</span><span class="s2">&#34;tool_calls&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="n">replacements</span> <span class="o">=</span> <span class="n">tool_calls</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s2">&#34;function&#34;</span><span class="p">][</span><span class="s2">&#34;arguments&#34;</span><span class="p">][</span><span class="s2">&#34;replacements&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Build reverse mapping: replacement → original (for deanonymization)</span>
</span></span><span class="line"><span class="cl"><span class="n">anonymized</span> <span class="o">=</span> <span class="n">text</span>
</span></span><span class="line"><span class="cl"><span class="n">mapping</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">pair</span> <span class="ow">in</span> <span class="n">replacements</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">anonymized</span> <span class="o">=</span> <span class="n">anonymized</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">pair</span><span class="p">[</span><span class="s2">&#34;original&#34;</span><span class="p">],</span> <span class="n">pair</span><span class="p">[</span><span class="s2">&#34;replacement&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">    <span class="n">mapping</span><span class="p">[</span><span class="n">pair</span><span class="p">[</span><span class="s2">&#34;replacement&#34;</span><span class="p">]]</span> <span class="o">=</span> <span class="n">pair</span><span class="p">[</span><span class="s2">&#34;original&#34;</span><span class="p">]</span>
</span></span></code></pre></div><p>The <code>/no_think</code> suffix tells the model to skip its chain-of-thought — faster response, same accuracy for this task.</p>
<p><strong><code>POST /deanonymize</code></strong> — no model call, just substitution:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">for</span> <span class="n">replacement</span><span class="p">,</span> <span class="n">original</span> <span class="ow">in</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">mapping</span><span class="o">.</span><span class="n">items</span><span class="p">(),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">replacement</span><span class="p">,</span> <span class="n">original</span><span class="p">)</span>
</span></span></code></pre></div><p>Sorted by length descending so longer tokens don&rsquo;t get partially overwritten by shorter ones.</p>
<h2 id="the-kubernetes-stack">The Kubernetes stack</h2>
<p>Ollama runs as a separate deployment in the same namespace as everything else (<code>web-ai-engine</code>). Intra-namespace traffic is always allowed — no new network policies.</p>
<pre tabindex="0"><code>llm-anonymizer (FastAPI) → Ollama (port 11434) → Anonymizer-1.7B GGUF
</code></pre><p>One-time model pull after first deploy:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">kubectl <span class="nb">exec</span> -n web-ai-engine deploy/ollama -- <span class="se">\
</span></span></span><span class="line"><span class="cl">  ollama pull hf.co/gabriellarson/Anonymizer-1.7B-GGUF
</span></span></code></pre></div><p>Ollama caches it on a 10Gi PVC, so pod restarts don&rsquo;t re-download.</p>
<h2 id="the-n8n-pipeline">The n8n pipeline</h2>
<p>Five-node chain triggered by webhook:</p>
<pre tabindex="0"><code>Webhook → /anonymize → NVIDIA NIM → /deanonymize → Respond
</code></pre><p>The NVIDIA NIM call includes a system prompt instructing it to treat the text as normal input. No mention of tokens, no special handling — because the text looks like real text.</p>
<p>Wire any upstream source to the webhook: Jira event, Slack slash command, a scheduled job that processes internal docs. The pipeline is source-agnostic.</p>
<h2 id="the-caveats">The caveats</h2>
<p><strong>1.7B isn&rsquo;t GPT-4.1.</strong> The model scores 9.20/10 on the benchmark — which means roughly 1 in 10 cases has a missed or incorrect entity. Test with real examples from your domain before depending on it.</p>
<p><strong>Deanonymization breaks on heavy rephrasing.</strong> If the cloud model restructures a sentence enough that the fake value no longer appears verbatim, the substitution silently misses it. The prompt helps but doesn&rsquo;t eliminate the risk.</p>
<p><strong>Ollama adds a deployment.</strong> It&rsquo;s ~500MB image + the model weights (~1GB Q4). On a constrained single-node cluster that&rsquo;s real overhead. llama-server already covers general chat; Ollama is purely for this model&rsquo;s tool-calling support.</p>
<h2 id="source">Source</h2>
<p><a href="https://github.com/janos-gyorgy/llm-anonymizer">github.com/janos-gyorgy/llm-anonymizer</a> — MIT licensed, Kubernetes manifests and n8n workflow included.</p>
]]></content:encoded></item></channel></rss>