The GPU assumption
Most write-ups about self-hosting LLMs start with a GPU. A 3090, an A100, at minimum something with CUDA. The implication is that without one you’re wasting your time — inference will be too slow to be useful.
That’s not been my experience.
I’ve been running a local LLM stack on a ThinkCentre mini PC (Intel N100, 16 GB RAM, no discrete GPU) for a few months. The model is Phi-3.5-mini-instruct, 3.8 billion parameters, 4-bit quantised. Response time is 3–6 tokens per second on CPU — slow enough that you notice it, fast enough that you use it. For the things I actually reach for a local model to do — rephrase something, summarise a document, explain a config option without sending it to an external API — the latency is fine.
The point isn’t that CPU inference beats GPU inference. It’s that “good enough for personal use” is a much lower bar than “production LLM serving”, and the hardware you already have probably clears it.
The stack
Two components:
llama.cpp (ghcr.io/ggml-org/llama.cpp:server) — inference server that loads a GGUF model file and exposes an OpenAI-compatible REST API. No Python, no framework overhead, minimal memory footprint beyond the model itself.
Open WebUI (ghcr.io/open-webui/open-webui) — a polished chat interface that speaks OpenAI API format. It points at the llama-server endpoint as its backend, handles conversation history, and supports RAG file uploads out of the box.
The architecture is simple on purpose:
Browser → Open WebUI (:80)
│
│ OpenAI-compatible API
▼
llama-server (:8080)
│
│ reads GGUF model file
▼
hostPath /srv/ai-models
Open WebUI doesn’t know or care that the backend is llama.cpp running on CPU. It sees an OpenAI-compatible API. This matters: if I swap llama-server for Ollama, vLLM, or a cloud endpoint, the frontend doesn’t change. The interface is the standard.
Model choice
GGUF models on Hugging Face are available at multiple quantisation levels. The trade-off is quality vs. RAM:
| Model | Quant | Size | RAM at runtime | Notes |
|---|---|---|---|---|
| Llama-3.2-3B | Q4_K_M | ~2 GB | ~3 GB | Fastest, lowest quality |
| Phi-3.5-mini | Q4_K_M | ~2.4 GB | ~3–4 GB | Good balance — what I use |
| Mistral-7B-Instruct | Q4_K_M | ~4.1 GB | ~5–6 GB | Noticeably better, needs more RAM |
| Llama-3.1-8B | Q4_K_M | ~4.7 GB | ~6–8 GB | High quality, stretches 16 GB with other workloads |
On 16 GB RAM with a full k3s stack running alongside (Argo CD, Traefik, Vault, Prometheus, etc.), Phi-3.5-mini leaves enough headroom that the cluster stays stable. Mistral-7B works too, but it’s tighter.
Models live in /srv/ai-models on the node, mounted into the pod as a hostPath volume. Single-node homelab, so there’s no scheduling concern. Download once with wget, done.
Key configuration choices
Context size (--ctx-size 4096): How many tokens the model holds in its attention window. Larger context = more RAM + slower inference. 4096 is fine for conversational use. If you’re summarising long documents, bump to 8192 and watch your RAM usage.
Max output tokens (--n-predict 1024): Hard cap on response length. llama.cpp will stop there even mid-sentence. 1024 is usually enough; increase if you find it cutting off long explanations.
Parallel slots (--parallel 1): How many concurrent inference requests the server handles. On CPU there’s no benefit to more than 1 — each slot competes for the same cores. Leave it at 1.
Memory limits: Set the container limit to roughly 2× the model’s file size. A 2.4 GB GGUF typically uses 3–4 GB at runtime with context loaded.
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
memory: 6Gi
No CPU limit. llama-server will use however many cores are available during inference — that’s what makes it usable. A CPU limit would throttle inference to unusable speeds.
Deployment as a GitOps push
The whole stack lives in one YAML values file, deployed through the extra-objects chart that I use for raw manifests across the cluster. Argo CD watches the repo and reconciles automatically.
Nothing was kubectl apply-ed. The deployment happened by pushing to Git.
What that means in practice: when I bumped the Open WebUI image version, I changed one line, pushed, and Argo CD rolled the pod. No manual steps, no SSH, no kubectl. The same process I use for any other service in the cluster.
The namespace, network policies, service account, and RBAC all generate from a single entry in applications.yml — same as every other app. The AI inference stack isn’t special from an operations perspective.
# applications.yml excerpt
- namespace: web-ai-engine
applications:
- applicationCode: web-ai-engine
path: helm-charts/extra-objects
autoSync: true
Access and auth
The service is exposed at ai.hippotion.com through the same dual-path ingress setup I use everywhere: Cloudflare Tunnel for external access, direct-to-server via Pi-hole DNS for local access, Traefik handling both with a wildcard Let’s Encrypt cert. See that post for the full explanation.
Auth is handled by Traefik’s ForwardAuth middleware pointing at an oauth2-proxy backed by GitLab. Open WebUI’s own auth is disabled (WEBUI_AUTH: false) — the OAuth layer upstream handles it. One login covers every service in the cluster.
The WEBUI_SECRET_KEY (used to sign Open WebUI sessions) comes from Vault via External Secrets Operator. Nothing sensitive in Git.
What the day-to-day is actually like
Slow is the obvious caveat. Phi-3.5-mini at 3–6 tok/s means a paragraph-length response takes 20–30 seconds. For coding help where you’re reading what came before while it generates, that’s fine. For quick factual lookups, it’s a little tedious.
The useful cases for a local model, for me:
- Rephrasing or editing text — paste something, ask it to tighten it. No data leaves the house.
- Config explanation — paste a Kubernetes manifest or a Traefik config block, ask what it does. Again, stays local.
- Quick summaries — short documents, log snippets, error messages.
- Experimentation — trying prompting techniques, testing system prompts, benchmarking quantisation levels without API costs.
For longer reasoning tasks I use a cloud model. The local stack is for the cases where I want the answer to stay on-premises, or where I’m iterating and don’t want to pay per token.
The starting point if you want to try it
The manifests are on GitHub: homelab-ai-inference-starter
It includes the llama-server and Open WebUI deployments, resource configuration, and ingress options for Traefik and nginx. The README walks through downloading a model, applying the manifests, and the configuration knobs worth knowing.
No GPU required. The ThinkCentre in the corner of my desk does the job.
