Llama.cpp on hippotion

📈 Observing Local LLM Inference: llama.cpp's Built-in Prometheus Metrics

Fri, 29 Aug 2025 00:00:00 +0000

What “operating an LLM” actually means

Running a local model is easy. Understanding what it’s doing is less so.

After deploying llama.cpp + Open WebUI on k3s (previous post), I had a chat interface backed by a local model. What I didn’t have: any visibility into how the model was behaving — whether requests were queuing, how fast tokens were being generated, how much of the context window was in use.

The instinct for this kind of problem is usually “add a proxy layer.” There are several tools in this space — LiteLLM being the most popular — that sit between the client and the inference server and record token counts, latency, and spend. I tried this first. LiteLLM OOMed at startup on a node already at 76% memory. Heavy Python import tree, not a lot of headroom.

The thing I’d missed: llama.cpp ships a Prometheus metrics endpoint. No proxy required.

`--metrics`

One additional argument to the inference server:

args:
  - -m
  - /models/Phi-3.5-mini-instruct-Q4_K_M.gguf
  - --host
  - "0.0.0.0"
  - --port
  - "8080"
  - --ctx-size
  - "4096"
  - --n-predict
  - "1024"
  - --parallel
  - "1"
  - --metrics        # ← this
  - --log-disable

After restart, GET /metrics on port 8080 returns valid Prometheus exposition format:

# HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total 0

# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 0

# HELP llamacpp:requests_processing Number of requests processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0

The full set of metrics:

Metric	Type	What it measures
`llamacpp:prompt_tokens_total`	counter	Input tokens processed (cumulative)
`llamacpp:tokens_predicted_total`	counter	Output tokens generated (cumulative)
`llamacpp:prompt_tokens_seconds`	gauge	Current prompt throughput (tok/s)
`llamacpp:predicted_tokens_seconds`	gauge	Current generation throughput (tok/s)
`llamacpp:tokens_predicted_seconds_total`	counter	Total time spent generating
`llamacpp:prompt_seconds_total`	counter	Total time spent on prompts
`llamacpp:requests_processing`	gauge	Requests currently being processed
`llamacpp:requests_deferred`	gauge	Requests queued, waiting for a slot
`llamacpp:n_decode_total`	counter	Total llama_decode() calls
`llamacpp:n_busy_slots_per_decode`	counter	Slots active per decode call

These cover the metrics that matter for a personal inference server: throughput, latency (derivable from total time / total tokens), and queue depth.

Prometheus scrape config

Adding a static scrape target in the existing Prometheus configuration:

extraScrapeConfigs: |
  - job_name: llama-server
    static_configs:
      - targets:
          - llama-server.web-ai-engine.svc:8080
    metrics_path: /metrics

The only non-obvious thing here is the network policy: Prometheus lives in dashboard-homelab, and llama-server lives in web-ai-engine. With Cilium network policies enforcing namespace isolation, the dashboard namespace needs to be allowed to make inbound connections to the AI engine namespace. In applications.yml:

- namespace: web-ai-engine
  networkPolicies:
    allowIngressFromNamespaces: [dashboard-homelab]

Without this, Prometheus scrape attempts fail silently with a timeout.

Grafana dashboard via ConfigMap

Rather than importing a dashboard JSON manually through the Grafana UI, the Grafana sidecar handles it automatically. Any ConfigMap with the label grafana_dashboard: "1" is picked up, loaded, and available in Grafana — across all namespaces by default.

The dashboard ConfigMap lives in web-ai-engine, not dashboard-homelab. The sidecar finds it regardless:

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboard-llm
  namespace: web-ai-engine
  labels:
    grafana_dashboard: "1"
data:
  llm-metrics.json: |
    {
      "title": "LLM Metrics",
      "uid": "llm-metrics",
      ...
    }

Argo CD reconciles the ConfigMap. The Grafana sidecar picks it up. The dashboard appears. No manual steps, no Grafana UI interaction, no state outside Git.

This means the dashboard is version-controlled, reproducible on cluster rebuild, and consistent across environments. The same YAML that describes the app’s Kubernetes resources also describes what the monitoring looks like.

What the dashboard shows

After sending a few messages through Open WebUI:

Generation throughput — the llamacpp:predicted_tokens_seconds gauge drops to 0 between requests and spikes during generation. On this hardware (Intel N100, CPU-only inference, Phi-3.5-mini Q4_K_M), it reads 3–5 tok/s during active generation. This is the number to watch if you’re comparing models or quantisation levels.

Cumulative tokens — llamacpp:prompt_tokens_total and llamacpp:tokens_predicted_total both increase monotonically. The ratio between them is roughly the input/output ratio of your usage pattern. For conversational use it’s typically 3:1 prompt to generation; for summarisation tasks it flips.

Queue depth — llamacpp:requests_deferred is 0 almost always, which is expected with --parallel 1. If it’s consistently above 0, you have more concurrent users than the server can handle with the current slot configuration.

ms/token — derived from rate(llamacpp:tokens_predicted_seconds_total[5m]) / rate(llamacpp:tokens_predicted_total[5m]) * 1000. This is the per-token latency, which is the number that governs whether the response feels fast or slow. 200–300ms/token feels instant; above 400ms you start noticing.

What’s missing compared to a proxy layer

LiteLLM and similar proxies give you things this setup doesn’t:

Per-model routing — if you’re running multiple models, a proxy can route requests to the right one. With a single model, irrelevant.
Virtual API keys — per-user or per-application key scoping. Not needed when the whole thing is behind SSO.
Spend tracking — meaningful when you’re paying per token. For a local model, the cost is electricity, which Prometheus already covers through the power monitoring dashboard.

For a single-model homelab, the native metrics are sufficient. If I add more models later or need per-user attribution, a proxy layer becomes worth the RAM.

The pattern

The broader point is that the observable unit here isn’t the proxy — it’s the inference server itself. Scraping llama.cpp directly means the metrics survive proxy changes, backend swaps, or routing redesigns. The inference server is the thing doing the work; it’s the right place to measure.

Starter manifests with the metrics configuration included: homelab-ai-inference-starter

🤖 Local LLM Inference on Kubernetes, No GPU Required

Fri, 15 Aug 2025 00:00:00 +0000

The GPU assumption

Most write-ups about self-hosting LLMs start with a GPU. A 3090, an A100, at minimum something with CUDA. The implication is that without one you’re wasting your time — inference will be too slow to be useful.

That’s not been my experience.

I’ve been running a local LLM stack on a ThinkCentre mini PC (Intel N100, 16 GB RAM, no discrete GPU) for a few months. The model is Phi-3.5-mini-instruct, 3.8 billion parameters, 4-bit quantised. Response time is 3–6 tokens per second on CPU — slow enough that you notice it, fast enough that you use it. For the things I actually reach for a local model to do — rephrase something, summarise a document, explain a config option without sending it to an external API — the latency is fine.

The point isn’t that CPU inference beats GPU inference. It’s that “good enough for personal use” is a much lower bar than “production LLM serving”, and the hardware you already have probably clears it.

The stack

Two components:

llama.cpp (ghcr.io/ggml-org/llama.cpp:server) — inference server that loads a GGUF model file and exposes an OpenAI-compatible REST API. No Python, no framework overhead, minimal memory footprint beyond the model itself.

Open WebUI (ghcr.io/open-webui/open-webui) — a polished chat interface that speaks OpenAI API format. It points at the llama-server endpoint as its backend, handles conversation history, and supports RAG file uploads out of the box.

The architecture is simple on purpose:

Browser → Open WebUI (:80)
              │
              │  OpenAI-compatible API
              ▼
         llama-server (:8080)
              │
              │  reads GGUF model file
              ▼
         hostPath /srv/ai-models

Open WebUI doesn’t know or care that the backend is llama.cpp running on CPU. It sees an OpenAI-compatible API. This matters: if I swap llama-server for Ollama, vLLM, or a cloud endpoint, the frontend doesn’t change. The interface is the standard.

Model choice

GGUF models on Hugging Face are available at multiple quantisation levels. The trade-off is quality vs. RAM:

Model	Quant	Size	RAM at runtime	Notes
Llama-3.2-3B	Q4_K_M	~2 GB	~3 GB	Fastest, lowest quality
Phi-3.5-mini	Q4_K_M	~2.4 GB	~3–4 GB	Good balance — what I use
Mistral-7B-Instruct	Q4_K_M	~4.1 GB	~5–6 GB	Noticeably better, needs more RAM
Llama-3.1-8B	Q4_K_M	~4.7 GB	~6–8 GB	High quality, stretches 16 GB with other workloads

On 16 GB RAM with a full k3s stack running alongside (Argo CD, Traefik, Vault, Prometheus, etc.), Phi-3.5-mini leaves enough headroom that the cluster stays stable. Mistral-7B works too, but it’s tighter.

Models live in /srv/ai-models on the node, mounted into the pod as a hostPath volume. Single-node homelab, so there’s no scheduling concern. Download once with wget, done.

Key configuration choices

Context size (--ctx-size 4096): How many tokens the model holds in its attention window. Larger context = more RAM + slower inference. 4096 is fine for conversational use. If you’re summarising long documents, bump to 8192 and watch your RAM usage.

Max output tokens (--n-predict 1024): Hard cap on response length. llama.cpp will stop there even mid-sentence. 1024 is usually enough; increase if you find it cutting off long explanations.

Parallel slots (--parallel 1): How many concurrent inference requests the server handles. On CPU there’s no benefit to more than 1 — each slot competes for the same cores. Leave it at 1.

Memory limits: Set the container limit to roughly 2× the model’s file size. A 2.4 GB GGUF typically uses 3–4 GB at runtime with context loaded.

resources:
  requests:
    cpu: 500m
    memory: 1Gi
  limits:
    memory: 6Gi

No CPU limit. llama-server will use however many cores are available during inference — that’s what makes it usable. A CPU limit would throttle inference to unusable speeds.

Deployment as a GitOps push

The whole stack lives in one YAML values file, deployed through the extra-objects chart that I use for raw manifests across the cluster. Argo CD watches the repo and reconciles automatically.

Nothing was kubectl apply-ed. The deployment happened by pushing to Git.

What that means in practice: when I bumped the Open WebUI image version, I changed one line, pushed, and Argo CD rolled the pod. No manual steps, no SSH, no kubectl. The same process I use for any other service in the cluster.

The namespace, network policies, service account, and RBAC all generate from a single entry in applications.yml — same as every other app. The AI inference stack isn’t special from an operations perspective.

# applications.yml excerpt
- namespace: web-ai-engine
  applications:
    - applicationCode: web-ai-engine
      path: helm-charts/extra-objects
      autoSync: true

Access and auth

The service is exposed at ai.hippotion.com through the same dual-path ingress setup I use everywhere: Cloudflare Tunnel for external access, direct-to-server via Pi-hole DNS for local access, Traefik handling both with a wildcard Let’s Encrypt cert. See that post for the full explanation.

Auth is handled by Traefik’s ForwardAuth middleware pointing at an oauth2-proxy backed by GitLab. Open WebUI’s own auth is disabled (WEBUI_AUTH: false) — the OAuth layer upstream handles it. One login covers every service in the cluster.

The WEBUI_SECRET_KEY (used to sign Open WebUI sessions) comes from Vault via External Secrets Operator. Nothing sensitive in Git.

What the day-to-day is actually like

Slow is the obvious caveat. Phi-3.5-mini at 3–6 tok/s means a paragraph-length response takes 20–30 seconds. For coding help where you’re reading what came before while it generates, that’s fine. For quick factual lookups, it’s a little tedious.

The useful cases for a local model, for me:

Rephrasing or editing text — paste something, ask it to tighten it. No data leaves the house.
Config explanation — paste a Kubernetes manifest or a Traefik config block, ask what it does. Again, stays local.
Quick summaries — short documents, log snippets, error messages.
Experimentation — trying prompting techniques, testing system prompts, benchmarking quantisation levels without API costs.

For longer reasoning tasks I use a cloud model. The local stack is for the cases where I want the answer to stay on-premises, or where I’m iterating and don’t want to pay per token.

The starting point if you want to try it

The manifests are on GitHub: homelab-ai-inference-starter

It includes the llama-server and Open WebUI deployments, resource configuration, and ingress options for Traefik and nginx. The README walks through downloading a model, applying the manifests, and the configuration knobs worth knowing.

No GPU required. The ThinkCentre in the corner of my desk does the job.