Ai | hippotion

🍵 I A/B-Tested Cloud vs Local LLMs in One n8n Agent. The Local One Faked It.

I built an AI agent in self-hosted n8n over my kombucha-tracking app, then gave it two brains — NVIDIA’s 70B and a local Phi-3.5 — sharing the same tools. The cloud model called the tools and answered from real data. The local one couldn’t, so it made things up.

the little robot stands guard at a doorway like a friendly bouncer, holding up a hand to check a stack of papers, while the boy in the cap watches; a shield symbol floats above them, protective and watchful

🔒 Building a PII Guardrail Proxy for Cloud LLM Calls

A local model classifies every prompt before it leaves the cluster. If it’s sensitive, it’s blocked. If it’s clean, it goes to NVIDIA NIM. 150 lines of FastAPI, deployed on k3s.

🕵️ Privacy-Preserving LLM Pipelines: Anonymize Before You Send

Replace PII with semantically realistic fakes before sending to a cloud LLM, then restore the originals from the response. Started with a general model and prompt engineering — then upgraded to a purpose-built 1.7B fine-tune via Ollama.

📈 Observing Local LLM Inference: llama.cpp's Built-in Prometheus Metrics

llama.cpp’s inference server ships a /metrics endpoint. One flag, Prometheus scraping, a Grafana dashboard loaded via ConfigMap sidecar — AI observability without a proxy layer.

🤖 Local LLM Inference on Kubernetes, No GPU Required

A CPU-only self-hosted LLM stack running on k3s: llama.cpp as the inference server, Open WebUI as the chat interface, deployed as a single Git push.