<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Prometheus on hippotion</title><link>https://blog.hippotion.com/tags/prometheus/</link><description>Recent content in Prometheus on hippotion</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Fri, 22 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.hippotion.com/tags/prometheus/index.xml" rel="self" type="application/rss+xml"/><item><title>Is Anyone Knocking? A Security Pass on My Homelab</title><link>https://blog.hippotion.com/posts/is-anyone-knocking/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/is-anyone-knocking/</guid><description>I set out to answer a simple worry — is someone trying to get into my server? — and found the scarier question underneath it: if they did, would I even know? My front door was solid. The inside had an alarm with the wires cut, a web terminal sitting on the open internet, and no floor under the blast radius. Here&amp;rsquo;s the audit, and the three things I fixed.</description><content:encoded><![CDATA[<h2 id="the-question-i-actually-had">The question I actually had</h2>
<p>It started as a nervous-Sunday kind of question: <em>is a third party trying to
get into my server — over SSH, or some other way?</em> I run a single-node
Kubernetes homelab that hosts a couple dozen little apps, some of them public.
You read about credential-stuffing bots and you start to wonder who&rsquo;s been
rattling the handle while you slept.</p>
<p>So I did the audit. The good news came first, and it&rsquo;s worth saying plainly
because it&rsquo;s the part most homelabs get wrong: <strong>the front door is solid.</strong>
Nothing is reachable from the internet except through a Cloudflare Tunnel —
an outbound-only connection, zero open inbound ports on my router. Almost
every service sits behind OAuth. The cluster has 140 network policies doing
real east-west segmentation. And the login history? Eleven straight weeks
where every single shell login came from one IP — my own workstation on the
LAN. No strangers. No 3 a.m. logins from a VPS in another hemisphere.</p>
<p>I could have stopped there feeling good. That would have been a mistake.</p>
<h2 id="the-scary-finding-wasnt-an-attacker">The scary finding wasn&rsquo;t an attacker</h2>
<p>The useful question turned out not to be <em>&ldquo;is someone knocking?&rdquo;</em> but
<em>&ldquo;if someone got in, would anything tell me?&rdquo;</em> And when I traced that wire,
it ended in the dark.</p>
<p>I have a full monitoring stack — Prometheus, Grafana, Alertmanager, the works.
Alertmanager was running. It was also configured to notify exactly <strong>no one</strong>:
no receivers, and upstream, <strong>no alert rules at all</strong>. It was a smoke detector
with the battery taken out and, for good measure, no smoke sensor either. If an
attacker had walked in, the alarm would have stayed perfectly, silently green.</p>
<p>That reframed the whole job. Three gaps, in priority order.</p>
<h2 id="gap-1--an-alarm-with-no-one-to-call">Gap 1 — an alarm with no one to call</h2>
<p>I built the missing chain end to end. A small exporter on the host parses the
SSH journal and <code>fail2ban</code> state and writes metrics into node_exporter&rsquo;s
textfile collector — so it rides the monitoring I already had instead of adding
a new moving part. On top sit the alert rules that were never there. The one
that matters most is blunt:</p>
<blockquote>
<p><strong>A shell login succeeded from a non-LAN IP.</strong></p>
</blockquote>
<p>That should be impossible in normal life, so if it ever fires, I want it
shouting. It now emails me the instant it happens, alongside quieter alerts for
brute-force spikes, distributed scans, <code>fail2ban</code> going down, and — the
meta-alert I&rsquo;m fondest of — <em>the watchdog itself going stale</em>, because a
security monitor that silently dies is worse than none. And <code>fail2ban</code> now
actually bans the bots, with escalating ban times and my LAN permanently on the
allow-list.</p>
<p>The honest lesson: I&rsquo;d been treating &ldquo;I have Prometheus&rdquo; as if it meant &ldquo;I have
monitoring.&rdquo; Dashboards you have to remember to look at are not monitoring.
<strong>Monitoring is the thing that interrupts you.</strong> Until an alert can reach your
phone, you don&rsquo;t have a security alarm — you have a security <em>museum</em>.</p>
<h2 id="gap-2--there-was-a-web-terminal-on-the-open-internet">Gap 2 — there was a web terminal on the open internet</h2>
<p>This is the one that made me wince. Among my public hostnames was <code>ttyd</code> — a
browser-based shell. A full terminal on my server, reachable from anywhere,
sitting behind a single OAuth proxy. One misconfiguration, one OAuth bypass,
and that&rsquo;s not &ldquo;an app is compromised,&rdquo; that&rsquo;s <em>root on the box from a browser
tab.</em></p>
<p>The fix here isn&rsquo;t more locks. It&rsquo;s the realization that <strong>the strongest
control is not exposing the thing at all.</strong> I deleted the web terminal
entirely — app, manifests, dashboard tile, all of it. Then I went down the
public hostname list and pulled everything with no business being public off
the tunnel: the secrets UI, the ingress dashboard, Prometheus, Alertmanager,
the network-observability console, the DNS admin. They still work — on my LAN,
over the same wildcard cert — they&rsquo;re just not the internet&rsquo;s business anymore.
A service that isn&rsquo;t exposed has no attack surface to harden.</p>
<h2 id="gap-3--no-floor-under-the-blast-radius">Gap 3 — no floor under the blast radius</h2>
<p>The network policies limit how far a compromised pod can talk sideways. But
nothing stopped a workload from running as root, mounting the host filesystem,
or grabbing the host network in the first place. So I turned on Kubernetes'
built-in Pod Security Admission: every namespace now at least <em>reports</em>
baseline violations, and the clean app namespaces <em>enforce</em> baseline —
meaning a compromised app there simply cannot request privileged mode or a
hostPath mount. It&rsquo;s a floor. Floors are underrated.</p>
<h2 id="what-the-audit-was-really-about">What the audit was really about</h2>
<p>I went looking for an intruder and didn&rsquo;t find one — the logs were clean, the
front door held. What I found instead was that I&rsquo;d built something secure at
the perimeter and then never asked the uncomfortable follow-up: <em>what happens
after the perimeter?</em> The answer had been &ldquo;nothing happens, and no one is
told,&rdquo; and I just hadn&rsquo;t looked.</p>
<p>Three principles I&rsquo;m taking with me:</p>
<ul>
<li><strong>An alarm that can&rsquo;t reach you is decoration.</strong> Wire the notification first;
the rules are easy once something is listening.</li>
<li><strong>Don&rsquo;t expose it beats add more auth.</strong> Every hostname you take off the
public internet is a class of attack you no longer have to be clever about.</li>
<li><strong>Give the blast radius a floor.</strong> Assume one thing gets popped, and decide
in advance how far it gets.</li>
</ul>
<p>The best part: all of it is GitOps. The intrusion alerts, the un-exposing, the
pod-security floor — every change is a commit, reviewable and revertible, and
my cluster reconciles itself to match. The audit didn&rsquo;t just make the homelab
safer. It wrote down <em>why</em> it&rsquo;s safer, in a form the next version of me can
read.</p>
<p>Now if someone knocks, I&rsquo;ll know. And the web terminal isn&rsquo;t answering the
door anymore — because it&rsquo;s gone.</p>
]]></content:encoded></item><item><title>📈 Observing Local LLM Inference: llama.cpp's Built-in Prometheus Metrics</title><link>https://blog.hippotion.com/posts/llm-observability-llamacpp-prometheus/</link><pubDate>Fri, 29 Aug 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/llm-observability-llamacpp-prometheus/</guid><description>llama.cpp&amp;rsquo;s inference server ships a /metrics endpoint. One flag, Prometheus scraping, a Grafana dashboard loaded via ConfigMap sidecar — AI observability without a proxy layer.</description><content:encoded><![CDATA[<h2 id="what-operating-an-llm-actually-means">What &ldquo;operating an LLM&rdquo; actually means</h2>
<p>Running a local model is easy. Understanding what it&rsquo;s doing is less so.</p>
<p>After deploying llama.cpp + Open WebUI on k3s (<a href="/posts/local-llm-k8s-no-gpu/">previous post</a>), I had a chat interface backed by a local model. What I didn&rsquo;t have: any visibility into how the model was behaving — whether requests were queuing, how fast tokens were being generated, how much of the context window was in use.</p>
<p>The instinct for this kind of problem is usually &ldquo;add a proxy layer.&rdquo; There are several tools in this space — LiteLLM being the most popular — that sit between the client and the inference server and record token counts, latency, and spend. I tried this first. LiteLLM OOMed at startup on a node already at 76% memory. Heavy Python import tree, not a lot of headroom.</p>
<p>The thing I&rsquo;d missed: llama.cpp ships a Prometheus metrics endpoint. No proxy required.</p>
<hr>
<h2 id="--metrics"><code>--metrics</code></h2>
<p>One additional argument to the inference server:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">args</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- -<span class="l">m</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">/models/Phi-3.5-mini-instruct-Q4_K_M.gguf</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">host</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;0.0.0.0&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">port</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;8080&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">ctx-size</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;4096&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="kc">n</span>-<span class="l">predict</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;1024&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">parallel</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;1&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">metrics       </span><span class="w"> </span><span class="c"># ← this</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">log-disable</span><span class="w">
</span></span></span></code></pre></div><p>After restart, <code>GET /metrics</code> on port 8080 returns valid Prometheus exposition format:</p>
<pre tabindex="0"><code># HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total 0

# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 0

# HELP llamacpp:requests_processing Number of requests processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
</code></pre><p>The full set of metrics:</p>
<table>
	<thead>
			<tr>
					<th>Metric</th>
					<th>Type</th>
					<th>What it measures</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td><code>llamacpp:prompt_tokens_total</code></td>
					<td>counter</td>
					<td>Input tokens processed (cumulative)</td>
			</tr>
			<tr>
					<td><code>llamacpp:tokens_predicted_total</code></td>
					<td>counter</td>
					<td>Output tokens generated (cumulative)</td>
			</tr>
			<tr>
					<td><code>llamacpp:prompt_tokens_seconds</code></td>
					<td>gauge</td>
					<td>Current prompt throughput (tok/s)</td>
			</tr>
			<tr>
					<td><code>llamacpp:predicted_tokens_seconds</code></td>
					<td>gauge</td>
					<td>Current generation throughput (tok/s)</td>
			</tr>
			<tr>
					<td><code>llamacpp:tokens_predicted_seconds_total</code></td>
					<td>counter</td>
					<td>Total time spent generating</td>
			</tr>
			<tr>
					<td><code>llamacpp:prompt_seconds_total</code></td>
					<td>counter</td>
					<td>Total time spent on prompts</td>
			</tr>
			<tr>
					<td><code>llamacpp:requests_processing</code></td>
					<td>gauge</td>
					<td>Requests currently being processed</td>
			</tr>
			<tr>
					<td><code>llamacpp:requests_deferred</code></td>
					<td>gauge</td>
					<td>Requests queued, waiting for a slot</td>
			</tr>
			<tr>
					<td><code>llamacpp:n_decode_total</code></td>
					<td>counter</td>
					<td>Total llama_decode() calls</td>
			</tr>
			<tr>
					<td><code>llamacpp:n_busy_slots_per_decode</code></td>
					<td>counter</td>
					<td>Slots active per decode call</td>
			</tr>
	</tbody>
</table>
<p>These cover the metrics that matter for a personal inference server: throughput, latency (derivable from total time / total tokens), and queue depth.</p>
<hr>
<h2 id="prometheus-scrape-config">Prometheus scrape config</h2>
<p>Adding a static scrape target in the existing Prometheus configuration:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">extraScrapeConfigs</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd">
</span></span></span><span class="line"><span class="cl"><span class="sd">  - job_name: llama-server
</span></span></span><span class="line"><span class="cl"><span class="sd">    static_configs:
</span></span></span><span class="line"><span class="cl"><span class="sd">      - targets:
</span></span></span><span class="line"><span class="cl"><span class="sd">          - llama-server.web-ai-engine.svc:8080
</span></span></span><span class="line"><span class="cl"><span class="sd">    metrics_path: /metrics</span><span class="w">
</span></span></span></code></pre></div><p>The only non-obvious thing here is the network policy: Prometheus lives in <code>dashboard-homelab</code>, and llama-server lives in <code>web-ai-engine</code>. With Cilium network policies enforcing namespace isolation, the dashboard namespace needs to be allowed to make inbound connections to the AI engine namespace. In <code>applications.yml</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl">- <span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">web-ai-engine</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">networkPolicies</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">allowIngressFromNamespaces</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">dashboard-homelab]</span><span class="w">
</span></span></span></code></pre></div><p>Without this, Prometheus scrape attempts fail silently with a timeout.</p>
<hr>
<h2 id="grafana-dashboard-via-configmap">Grafana dashboard via ConfigMap</h2>
<p>Rather than importing a dashboard JSON manually through the Grafana UI, the Grafana sidecar handles it automatically. Any ConfigMap with the label <code>grafana_dashboard: &quot;1&quot;</code> is picked up, loaded, and available in Grafana — across all namespaces by default.</p>
<p>The dashboard ConfigMap lives in <code>web-ai-engine</code>, not <code>dashboard-homelab</code>. The sidecar finds it regardless:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">v1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">ConfigMap</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">grafana-dashboard-llm</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">web-ai-engine</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">labels</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">grafana_dashboard</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;1&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">data</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">llm-metrics.json</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd">
</span></span></span><span class="line"><span class="cl"><span class="sd">    {
</span></span></span><span class="line"><span class="cl"><span class="sd">      &#34;title&#34;: &#34;LLM Metrics&#34;,
</span></span></span><span class="line"><span class="cl"><span class="sd">      &#34;uid&#34;: &#34;llm-metrics&#34;,
</span></span></span><span class="line"><span class="cl"><span class="sd">      ...
</span></span></span><span class="line"><span class="cl"><span class="sd">    }</span><span class="w">
</span></span></span></code></pre></div><p>Argo CD reconciles the ConfigMap. The Grafana sidecar picks it up. The dashboard appears. No manual steps, no Grafana UI interaction, no state outside Git.</p>
<p>This means the dashboard is version-controlled, reproducible on cluster rebuild, and consistent across environments. The same YAML that describes the app&rsquo;s Kubernetes resources also describes what the monitoring looks like.</p>
<hr>
<h2 id="what-the-dashboard-shows">What the dashboard shows</h2>
<p>After sending a few messages through Open WebUI:</p>
<p><strong>Generation throughput</strong> — the <code>llamacpp:predicted_tokens_seconds</code> gauge drops to 0 between requests and spikes during generation. On this hardware (Intel N100, CPU-only inference, Phi-3.5-mini Q4_K_M), it reads 3–5 tok/s during active generation. This is the number to watch if you&rsquo;re comparing models or quantisation levels.</p>
<p><strong>Cumulative tokens</strong> — <code>llamacpp:prompt_tokens_total</code> and <code>llamacpp:tokens_predicted_total</code> both increase monotonically. The ratio between them is roughly the input/output ratio of your usage pattern. For conversational use it&rsquo;s typically 3:1 prompt to generation; for summarisation tasks it flips.</p>
<p><strong>Queue depth</strong> — <code>llamacpp:requests_deferred</code> is 0 almost always, which is expected with <code>--parallel 1</code>. If it&rsquo;s consistently above 0, you have more concurrent users than the server can handle with the current slot configuration.</p>
<p><strong>ms/token</strong> — derived from <code>rate(llamacpp:tokens_predicted_seconds_total[5m]) / rate(llamacpp:tokens_predicted_total[5m]) * 1000</code>. This is the per-token latency, which is the number that governs whether the response feels fast or slow. 200–300ms/token feels instant; above 400ms you start noticing.</p>
<hr>
<h2 id="whats-missing-compared-to-a-proxy-layer">What&rsquo;s missing compared to a proxy layer</h2>
<p>LiteLLM and similar proxies give you things this setup doesn&rsquo;t:</p>
<ul>
<li><strong>Per-model routing</strong> — if you&rsquo;re running multiple models, a proxy can route requests to the right one. With a single model, irrelevant.</li>
<li><strong>Virtual API keys</strong> — per-user or per-application key scoping. Not needed when the whole thing is behind SSO.</li>
<li><strong>Spend tracking</strong> — meaningful when you&rsquo;re paying per token. For a local model, the cost is electricity, which Prometheus already covers through the power monitoring dashboard.</li>
</ul>
<p>For a single-model homelab, the native metrics are sufficient. If I add more models later or need per-user attribution, a proxy layer becomes worth the RAM.</p>
<hr>
<h2 id="the-pattern">The pattern</h2>
<p>The broader point is that the observable unit here isn&rsquo;t the proxy — it&rsquo;s the inference server itself. Scraping llama.cpp directly means the metrics survive proxy changes, backend swaps, or routing redesigns. The inference server is the thing doing the work; it&rsquo;s the right place to measure.</p>
<p>Starter manifests with the metrics configuration included: <a href="https://github.com/janos-gyorgy/homelab-ai-inference-starter">homelab-ai-inference-starter</a></p>
]]></content:encoded></item></channel></rss>