The problem with cloud LLM access
Running a local model is great for privacy. But local models hit a ceiling β for the heavy lifting, you want a cloud API like NVIDIA NIM with Llama 3.3 70B.
The moment you open that channel, you have a new risk: what if someone (or some automation) accidentally pastes a password, a private key, or someone’s personal data into the chat? It leaves the cluster. It’s logged somewhere you don’t control.
The standard answer is “train your users.” I’d rather have a technical control.
The architecture
Open WebUI β ai-guard proxy
β
ββββββββββ΄βββββββββ
β β
llama-server if SAFE:
(classify) forward to NVIDIA NIM
β
if SENSITIVE:
block + explain
Every request to NVIDIA NIM goes through ai-guard first. ai-guard pulls the user message, sends it to the local llama.cpp server with a classification prompt, and makes a binary decision:
SAFEβ forward to NVIDIA NIM with the real API key (which ai-guard holds, not the client)SENSITIVE: <reason>β return HTTP 400, log the block, nothing leaves the cluster
The local model is already running for inference β this reuses it as a privacy gatekeeper at zero extra infrastructure cost.
The implementation
The proxy is ~150 lines of FastAPI. The classifier call:
CLASSIFIER_PROMPT = """You are a data security classifier. Check if the text below contains sensitive information:
passwords, API keys, tokens, credentials, personal identifiable information (names, emails, phone numbers, SSNs, addresses), financial data (card numbers, bank accounts), or private keys.
Reply with ONLY one of:
SAFE
SENSITIVE: <one-line reason>
Text to check:
"""
async def classify(text: str) -> tuple[bool, str]:
async with httpx.AsyncClient(timeout=60) as client:
resp = await client.post(
f"{LLAMA_BASE}/chat/completions",
json={
"model": "phi-3.5-mini",
"messages": [{"role": "user", "content": CLASSIFIER_PROMPT + text[:3000]}],
"max_tokens": 30,
"temperature": 0,
"stream": False,
},
headers={"Authorization": "Bearer sk-no-key"},
)
answer = resp.json()["choices"][0]["message"]["content"].strip()
if answer.upper().startswith("SENSITIVE"):
reason = answer.split(":", 1)[1].strip() if ":" in answer else "sensitive content detected"
return True, reason
return False, ""
temperature=0 and max_tokens=30 keep the response deterministic and fast. The model only needs to output one word or one line.
The main handler:
@app.post("/v1/chat/completions")
async def proxy_chat(request: Request):
body = await request.json()
user_text = extract_user_text(body.get("messages", []))
if user_text.strip():
try:
is_sensitive, reason = await classify(user_text)
except Exception as exc:
log.error("classifier error: %s β allowing request through", exc)
is_sensitive = False
if is_sensitive:
return JSONResponse(status_code=400, content={
"error": {
"message": f"Request blocked by ai-guard: {reason}. Remove sensitive content before sending to external models.",
"type": "content_policy_violation",
}
})
# Safe β forward to upstream with streaming support
...
Fail-open: if the classifier itself errors (llama-server down, timeout), the request goes through and the error is logged. Fail-closed would be safer for high-stakes environments, but this is a homelab and I’d rather not block all cloud LLM access because the local model is warming up.
Kubernetes deployment
ai-guard runs in the same namespace as llama-server and Open WebUI (web-ai-engine). Intra-namespace traffic is always allowed in Cilium, so no new network policy needed.
Open WebUI uses semicolon-separated lists for multiple API backends:
- name: OPENAI_API_BASE_URLS
value: "http://llama-server.web-ai-engine.svc:8080/v1;http://ai-guard.web-ai-engine.svc:8080/v1"
- name: OPENAI_API_KEYS
value: "sk-no-key;sk-no-key"
The second entry is ai-guard. Open WebUI passes sk-no-key as the API key β ai-guard ignores it and uses its own UPSTREAM_API_KEY from a Kubernetes Secret (pulled from Vault via External Secrets Operator). The real NVIDIA API key never touches the client.
The latency tradeoff
The classification step adds 5β15 seconds on CPU inference. That’s the cost of keeping the check fully private β the classifier never sends data anywhere.
For a personal homelab assistant, this is fine. For a high-throughput production setup, you’d want the classifier on a GPU or a dedicated smaller model purpose-built for classification.
What it catches
The classifier prompt targets:
- Passwords, API keys, tokens, credentials
- PII: names, emails, phone numbers, SSNs, addresses
- Financial data: card numbers, bank accounts
- Private keys
False negatives are possible β no classifier is perfect. This is a first line of defense, not a compliance control. The value is catching the obvious, accidental leaks.
Source
github.com/janos-gyorgy/ai-guard β MIT licensed, Kubernetes manifests included.
