n8n workflow canvas

🍵 I A/B-Tested Cloud vs Local LLMs in One n8n Agent. The Local One Faked It.

I built an AI agent in self-hosted n8n over my kombucha-tracking app, then gave it two brains — NVIDIA’s 70B and a local Phi-3.5 — sharing the same tools. The cloud model called the tools and answered from real data. The local one couldn’t, so it made things up.

the little robot stands guard at a doorway like a friendly bouncer, holding up a hand to check a stack of papers, while the boy in the cap watches; a shield symbol floats above them, protective and watchful

🔒 Building a PII Guardrail Proxy for Cloud LLM Calls

A local model classifies every prompt before it leaves the cluster. If it’s sensitive, it’s blocked. If it’s clean, it goes to NVIDIA NIM. 150 lines of FastAPI, deployed on k3s.

cartoon cover for: Privacy-Preserving LLM Pipelines: Anonymize Before You Send

🕵️ Privacy-Preserving LLM Pipelines: Anonymize Before You Send

Replace PII with semantically realistic fakes before sending to a cloud LLM, then restore the originals from the response. Started with a general model and prompt engineering — then upgraded to a purpose-built 1.7B fine-tune via Ollama.

cartoon cover for: Observing Local LLM Inference: llama.cpp's Built-in Prometheus Metrics

📈 Observing Local LLM Inference: llama.cpp's Built-in Prometheus Metrics

llama.cpp’s inference server ships a /metrics endpoint. One flag, Prometheus scraping, a Grafana dashboard loaded via ConfigMap sidecar — AI observability without a proxy layer.

cartoon cover for: Local LLM Inference on Kubernetes, No GPU Required

🤖 Local LLM Inference on Kubernetes, No GPU Required

A CPU-only self-hosted LLM stack running on k3s: llama.cpp as the inference server, Open WebUI as the chat interface, deployed as a single Git push.