Prometheus on hippotion

Is Anyone Knocking? A Security Pass on My Homelab

Fri, 22 May 2026 00:00:00 +0000

The question I actually had

It started as a nervous-Sunday kind of question: is a third party trying to get into my server — over SSH, or some other way? I run a single-node Kubernetes homelab that hosts a couple dozen little apps, some of them public. You read about credential-stuffing bots and you start to wonder who’s been rattling the handle while you slept.

So I did the audit. The good news came first, and it’s worth saying plainly because it’s the part most homelabs get wrong: the front door is solid. Nothing is reachable from the internet except through a Cloudflare Tunnel — an outbound-only connection, zero open inbound ports on my router. Almost every service sits behind OAuth. The cluster has 140 network policies doing real east-west segmentation. And the login history? Eleven straight weeks where every single shell login came from one IP — my own workstation on the LAN. No strangers. No 3 a.m. logins from a VPS in another hemisphere.

I could have stopped there feeling good. That would have been a mistake.

The scary finding wasn’t an attacker

The useful question turned out not to be “is someone knocking?” but “if someone got in, would anything tell me?” And when I traced that wire, it ended in the dark.

I have a full monitoring stack — Prometheus, Grafana, Alertmanager, the works. Alertmanager was running. It was also configured to notify exactly no one: no receivers, and upstream, no alert rules at all. It was a smoke detector with the battery taken out and, for good measure, no smoke sensor either. If an attacker had walked in, the alarm would have stayed perfectly, silently green.

That reframed the whole job. Three gaps, in priority order.

Gap 1 — an alarm with no one to call

I built the missing chain end to end. A small exporter on the host parses the SSH journal and fail2ban state and writes metrics into node_exporter’s textfile collector — so it rides the monitoring I already had instead of adding a new moving part. On top sit the alert rules that were never there. The one that matters most is blunt:

A shell login succeeded from a non-LAN IP.

That should be impossible in normal life, so if it ever fires, I want it shouting. It now emails me the instant it happens, alongside quieter alerts for brute-force spikes, distributed scans, fail2ban going down, and — the meta-alert I’m fondest of — the watchdog itself going stale, because a security monitor that silently dies is worse than none. And fail2ban now actually bans the bots, with escalating ban times and my LAN permanently on the allow-list.

The honest lesson: I’d been treating “I have Prometheus” as if it meant “I have monitoring.” Dashboards you have to remember to look at are not monitoring. Monitoring is the thing that interrupts you. Until an alert can reach your phone, you don’t have a security alarm — you have a security museum.

Gap 2 — there was a web terminal on the open internet

This is the one that made me wince. Among my public hostnames was ttyd — a browser-based shell. A full terminal on my server, reachable from anywhere, sitting behind a single OAuth proxy. One misconfiguration, one OAuth bypass, and that’s not “an app is compromised,” that’s root on the box from a browser tab.

The fix here isn’t more locks. It’s the realization that the strongest control is not exposing the thing at all. I deleted the web terminal entirely — app, manifests, dashboard tile, all of it. Then I went down the public hostname list and pulled everything with no business being public off the tunnel: the secrets UI, the ingress dashboard, Prometheus, Alertmanager, the network-observability console, the DNS admin. They still work — on my LAN, over the same wildcard cert — they’re just not the internet’s business anymore. A service that isn’t exposed has no attack surface to harden.

Gap 3 — no floor under the blast radius

The network policies limit how far a compromised pod can talk sideways. But nothing stopped a workload from running as root, mounting the host filesystem, or grabbing the host network in the first place. So I turned on Kubernetes' built-in Pod Security Admission: every namespace now at least reports baseline violations, and the clean app namespaces enforce baseline — meaning a compromised app there simply cannot request privileged mode or a hostPath mount. It’s a floor. Floors are underrated.

What the audit was really about

I went looking for an intruder and didn’t find one — the logs were clean, the front door held. What I found instead was that I’d built something secure at the perimeter and then never asked the uncomfortable follow-up: what happens after the perimeter? The answer had been “nothing happens, and no one is told,” and I just hadn’t looked.

Three principles I’m taking with me:

An alarm that can’t reach you is decoration. Wire the notification first; the rules are easy once something is listening.
Don’t expose it beats add more auth. Every hostname you take off the public internet is a class of attack you no longer have to be clever about.
Give the blast radius a floor. Assume one thing gets popped, and decide in advance how far it gets.

The best part: all of it is GitOps. The intrusion alerts, the un-exposing, the pod-security floor — every change is a commit, reviewable and revertible, and my cluster reconciles itself to match. The audit didn’t just make the homelab safer. It wrote down why it’s safer, in a form the next version of me can read.

Now if someone knocks, I’ll know. And the web terminal isn’t answering the door anymore — because it’s gone.

📈 Observing Local LLM Inference: llama.cpp's Built-in Prometheus Metrics

Fri, 29 Aug 2025 00:00:00 +0000

What “operating an LLM” actually means

Running a local model is easy. Understanding what it’s doing is less so.

After deploying llama.cpp + Open WebUI on k3s (previous post), I had a chat interface backed by a local model. What I didn’t have: any visibility into how the model was behaving — whether requests were queuing, how fast tokens were being generated, how much of the context window was in use.

The instinct for this kind of problem is usually “add a proxy layer.” There are several tools in this space — LiteLLM being the most popular — that sit between the client and the inference server and record token counts, latency, and spend. I tried this first. LiteLLM OOMed at startup on a node already at 76% memory. Heavy Python import tree, not a lot of headroom.

The thing I’d missed: llama.cpp ships a Prometheus metrics endpoint. No proxy required.

`--metrics`

One additional argument to the inference server:

args:
  - -m
  - /models/Phi-3.5-mini-instruct-Q4_K_M.gguf
  - --host
  - "0.0.0.0"
  - --port
  - "8080"
  - --ctx-size
  - "4096"
  - --n-predict
  - "1024"
  - --parallel
  - "1"
  - --metrics        # ← this
  - --log-disable

After restart, GET /metrics on port 8080 returns valid Prometheus exposition format:

# HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total 0

# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 0

# HELP llamacpp:requests_processing Number of requests processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0

The full set of metrics:

Metric	Type	What it measures
`llamacpp:prompt_tokens_total`	counter	Input tokens processed (cumulative)
`llamacpp:tokens_predicted_total`	counter	Output tokens generated (cumulative)
`llamacpp:prompt_tokens_seconds`	gauge	Current prompt throughput (tok/s)
`llamacpp:predicted_tokens_seconds`	gauge	Current generation throughput (tok/s)
`llamacpp:tokens_predicted_seconds_total`	counter	Total time spent generating
`llamacpp:prompt_seconds_total`	counter	Total time spent on prompts
`llamacpp:requests_processing`	gauge	Requests currently being processed
`llamacpp:requests_deferred`	gauge	Requests queued, waiting for a slot
`llamacpp:n_decode_total`	counter	Total llama_decode() calls
`llamacpp:n_busy_slots_per_decode`	counter	Slots active per decode call

These cover the metrics that matter for a personal inference server: throughput, latency (derivable from total time / total tokens), and queue depth.

Prometheus scrape config

Adding a static scrape target in the existing Prometheus configuration:

extraScrapeConfigs: |
  - job_name: llama-server
    static_configs:
      - targets:
          - llama-server.web-ai-engine.svc:8080
    metrics_path: /metrics

The only non-obvious thing here is the network policy: Prometheus lives in dashboard-homelab, and llama-server lives in web-ai-engine. With Cilium network policies enforcing namespace isolation, the dashboard namespace needs to be allowed to make inbound connections to the AI engine namespace. In applications.yml:

- namespace: web-ai-engine
  networkPolicies:
    allowIngressFromNamespaces: [dashboard-homelab]

Without this, Prometheus scrape attempts fail silently with a timeout.

Grafana dashboard via ConfigMap

Rather than importing a dashboard JSON manually through the Grafana UI, the Grafana sidecar handles it automatically. Any ConfigMap with the label grafana_dashboard: "1" is picked up, loaded, and available in Grafana — across all namespaces by default.

The dashboard ConfigMap lives in web-ai-engine, not dashboard-homelab. The sidecar finds it regardless:

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboard-llm
  namespace: web-ai-engine
  labels:
    grafana_dashboard: "1"
data:
  llm-metrics.json: |
    {
      "title": "LLM Metrics",
      "uid": "llm-metrics",
      ...
    }

Argo CD reconciles the ConfigMap. The Grafana sidecar picks it up. The dashboard appears. No manual steps, no Grafana UI interaction, no state outside Git.

This means the dashboard is version-controlled, reproducible on cluster rebuild, and consistent across environments. The same YAML that describes the app’s Kubernetes resources also describes what the monitoring looks like.

What the dashboard shows

After sending a few messages through Open WebUI:

Generation throughput — the llamacpp:predicted_tokens_seconds gauge drops to 0 between requests and spikes during generation. On this hardware (Intel N100, CPU-only inference, Phi-3.5-mini Q4_K_M), it reads 3–5 tok/s during active generation. This is the number to watch if you’re comparing models or quantisation levels.

Cumulative tokens — llamacpp:prompt_tokens_total and llamacpp:tokens_predicted_total both increase monotonically. The ratio between them is roughly the input/output ratio of your usage pattern. For conversational use it’s typically 3:1 prompt to generation; for summarisation tasks it flips.

Queue depth — llamacpp:requests_deferred is 0 almost always, which is expected with --parallel 1. If it’s consistently above 0, you have more concurrent users than the server can handle with the current slot configuration.

ms/token — derived from rate(llamacpp:tokens_predicted_seconds_total[5m]) / rate(llamacpp:tokens_predicted_total[5m]) * 1000. This is the per-token latency, which is the number that governs whether the response feels fast or slow. 200–300ms/token feels instant; above 400ms you start noticing.

What’s missing compared to a proxy layer

LiteLLM and similar proxies give you things this setup doesn’t:

Per-model routing — if you’re running multiple models, a proxy can route requests to the right one. With a single model, irrelevant.
Virtual API keys — per-user or per-application key scoping. Not needed when the whole thing is behind SSO.
Spend tracking — meaningful when you’re paying per token. For a local model, the cost is electricity, which Prometheus already covers through the power monitoring dashboard.

For a single-model homelab, the native metrics are sufficient. If I add more models later or need per-user attribution, a proxy layer becomes worth the RAM.

The pattern

The broader point is that the observable unit here isn’t the proxy — it’s the inference server itself. Scraping llama.cpp directly means the metrics survive proxy changes, backend swaps, or routing redesigns. The inference server is the thing doing the work; it’s the right place to measure.

Starter manifests with the metrics configuration included: homelab-ai-inference-starter