What “operating an LLM” actually means
Running a local model is easy. Understanding what it’s doing is less so.
After deploying llama.cpp + Open WebUI on k3s (previous post), I had a chat interface backed by a local model. What I didn’t have: any visibility into how the model was behaving — whether requests were queuing, how fast tokens were being generated, how much of the context window was in use.
The instinct for this kind of problem is usually “add a proxy layer.” There are several tools in this space — LiteLLM being the most popular — that sit between the client and the inference server and record token counts, latency, and spend. I tried this first. LiteLLM OOMed at startup on a node already at 76% memory. Heavy Python import tree, not a lot of headroom.
The thing I’d missed: llama.cpp ships a Prometheus metrics endpoint. No proxy required.
--metrics
One additional argument to the inference server:
args:
- -m
- /models/Phi-3.5-mini-instruct-Q4_K_M.gguf
- --host
- "0.0.0.0"
- --port
- "8080"
- --ctx-size
- "4096"
- --n-predict
- "1024"
- --parallel
- "1"
- --metrics # ← this
- --log-disable
After restart, GET /metrics on port 8080 returns valid Prometheus exposition format:
# HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total 0
# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 0
# HELP llamacpp:requests_processing Number of requests processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
The full set of metrics:
| Metric | Type | What it measures |
|---|---|---|
llamacpp:prompt_tokens_total | counter | Input tokens processed (cumulative) |
llamacpp:tokens_predicted_total | counter | Output tokens generated (cumulative) |
llamacpp:prompt_tokens_seconds | gauge | Current prompt throughput (tok/s) |
llamacpp:predicted_tokens_seconds | gauge | Current generation throughput (tok/s) |
llamacpp:tokens_predicted_seconds_total | counter | Total time spent generating |
llamacpp:prompt_seconds_total | counter | Total time spent on prompts |
llamacpp:requests_processing | gauge | Requests currently being processed |
llamacpp:requests_deferred | gauge | Requests queued, waiting for a slot |
llamacpp:n_decode_total | counter | Total llama_decode() calls |
llamacpp:n_busy_slots_per_decode | counter | Slots active per decode call |
These cover the metrics that matter for a personal inference server: throughput, latency (derivable from total time / total tokens), and queue depth.
Prometheus scrape config
Adding a static scrape target in the existing Prometheus configuration:
extraScrapeConfigs: |
- job_name: llama-server
static_configs:
- targets:
- llama-server.web-ai-engine.svc:8080
metrics_path: /metrics
The only non-obvious thing here is the network policy: Prometheus lives in dashboard-homelab, and llama-server lives in web-ai-engine. With Cilium network policies enforcing namespace isolation, the dashboard namespace needs to be allowed to make inbound connections to the AI engine namespace. In applications.yml:
- namespace: web-ai-engine
networkPolicies:
allowIngressFromNamespaces: [dashboard-homelab]
Without this, Prometheus scrape attempts fail silently with a timeout.
Grafana dashboard via ConfigMap
Rather than importing a dashboard JSON manually through the Grafana UI, the Grafana sidecar handles it automatically. Any ConfigMap with the label grafana_dashboard: "1" is picked up, loaded, and available in Grafana — across all namespaces by default.
The dashboard ConfigMap lives in web-ai-engine, not dashboard-homelab. The sidecar finds it regardless:
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-llm
namespace: web-ai-engine
labels:
grafana_dashboard: "1"
data:
llm-metrics.json: |
{
"title": "LLM Metrics",
"uid": "llm-metrics",
...
}
Argo CD reconciles the ConfigMap. The Grafana sidecar picks it up. The dashboard appears. No manual steps, no Grafana UI interaction, no state outside Git.
This means the dashboard is version-controlled, reproducible on cluster rebuild, and consistent across environments. The same YAML that describes the app’s Kubernetes resources also describes what the monitoring looks like.
What the dashboard shows
After sending a few messages through Open WebUI:
Generation throughput — the llamacpp:predicted_tokens_seconds gauge drops to 0 between requests and spikes during generation. On this hardware (Intel N100, CPU-only inference, Phi-3.5-mini Q4_K_M), it reads 3–5 tok/s during active generation. This is the number to watch if you’re comparing models or quantisation levels.
Cumulative tokens — llamacpp:prompt_tokens_total and llamacpp:tokens_predicted_total both increase monotonically. The ratio between them is roughly the input/output ratio of your usage pattern. For conversational use it’s typically 3:1 prompt to generation; for summarisation tasks it flips.
Queue depth — llamacpp:requests_deferred is 0 almost always, which is expected with --parallel 1. If it’s consistently above 0, you have more concurrent users than the server can handle with the current slot configuration.
ms/token — derived from rate(llamacpp:tokens_predicted_seconds_total[5m]) / rate(llamacpp:tokens_predicted_total[5m]) * 1000. This is the per-token latency, which is the number that governs whether the response feels fast or slow. 200–300ms/token feels instant; above 400ms you start noticing.
What’s missing compared to a proxy layer
LiteLLM and similar proxies give you things this setup doesn’t:
- Per-model routing — if you’re running multiple models, a proxy can route requests to the right one. With a single model, irrelevant.
- Virtual API keys — per-user or per-application key scoping. Not needed when the whole thing is behind SSO.
- Spend tracking — meaningful when you’re paying per token. For a local model, the cost is electricity, which Prometheus already covers through the power monitoring dashboard.
For a single-model homelab, the native metrics are sufficient. If I add more models later or need per-user attribution, a proxy layer becomes worth the RAM.
The pattern
The broader point is that the observable unit here isn’t the proxy — it’s the inference server itself. Scraping llama.cpp directly means the metrics survive proxy changes, backend swaps, or routing redesigns. The inference server is the thing doing the work; it’s the right place to measure.
Starter manifests with the metrics configuration included: homelab-ai-inference-starter
