<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Llama.cpp on hippotion</title><link>https://blog.hippotion.com/tags/llama.cpp/</link><description>Recent content in Llama.cpp on hippotion</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Fri, 29 Aug 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.hippotion.com/tags/llama.cpp/index.xml" rel="self" type="application/rss+xml"/><item><title>📈 Observing Local LLM Inference: llama.cpp's Built-in Prometheus Metrics</title><link>https://blog.hippotion.com/posts/llm-observability-llamacpp-prometheus/</link><pubDate>Fri, 29 Aug 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/llm-observability-llamacpp-prometheus/</guid><description>llama.cpp&amp;rsquo;s inference server ships a /metrics endpoint. One flag, Prometheus scraping, a Grafana dashboard loaded via ConfigMap sidecar — AI observability without a proxy layer.</description><content:encoded><![CDATA[<h2 id="what-operating-an-llm-actually-means">What &ldquo;operating an LLM&rdquo; actually means</h2>
<p>Running a local model is easy. Understanding what it&rsquo;s doing is less so.</p>
<p>After deploying llama.cpp + Open WebUI on k3s (<a href="/posts/local-llm-k8s-no-gpu/">previous post</a>), I had a chat interface backed by a local model. What I didn&rsquo;t have: any visibility into how the model was behaving — whether requests were queuing, how fast tokens were being generated, how much of the context window was in use.</p>
<p>The instinct for this kind of problem is usually &ldquo;add a proxy layer.&rdquo; There are several tools in this space — LiteLLM being the most popular — that sit between the client and the inference server and record token counts, latency, and spend. I tried this first. LiteLLM OOMed at startup on a node already at 76% memory. Heavy Python import tree, not a lot of headroom.</p>
<p>The thing I&rsquo;d missed: llama.cpp ships a Prometheus metrics endpoint. No proxy required.</p>
<hr>
<h2 id="--metrics"><code>--metrics</code></h2>
<p>One additional argument to the inference server:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">args</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- -<span class="l">m</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">/models/Phi-3.5-mini-instruct-Q4_K_M.gguf</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">host</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;0.0.0.0&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">port</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;8080&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">ctx-size</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;4096&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="kc">n</span>-<span class="l">predict</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;1024&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">parallel</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="s2">&#34;1&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">metrics       </span><span class="w"> </span><span class="c"># ← this</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- --<span class="l">log-disable</span><span class="w">
</span></span></span></code></pre></div><p>After restart, <code>GET /metrics</code> on port 8080 returns valid Prometheus exposition format:</p>
<pre tabindex="0"><code># HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total 0

# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 0

# HELP llamacpp:requests_processing Number of requests processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
</code></pre><p>The full set of metrics:</p>
<table>
	<thead>
			<tr>
					<th>Metric</th>
					<th>Type</th>
					<th>What it measures</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td><code>llamacpp:prompt_tokens_total</code></td>
					<td>counter</td>
					<td>Input tokens processed (cumulative)</td>
			</tr>
			<tr>
					<td><code>llamacpp:tokens_predicted_total</code></td>
					<td>counter</td>
					<td>Output tokens generated (cumulative)</td>
			</tr>
			<tr>
					<td><code>llamacpp:prompt_tokens_seconds</code></td>
					<td>gauge</td>
					<td>Current prompt throughput (tok/s)</td>
			</tr>
			<tr>
					<td><code>llamacpp:predicted_tokens_seconds</code></td>
					<td>gauge</td>
					<td>Current generation throughput (tok/s)</td>
			</tr>
			<tr>
					<td><code>llamacpp:tokens_predicted_seconds_total</code></td>
					<td>counter</td>
					<td>Total time spent generating</td>
			</tr>
			<tr>
					<td><code>llamacpp:prompt_seconds_total</code></td>
					<td>counter</td>
					<td>Total time spent on prompts</td>
			</tr>
			<tr>
					<td><code>llamacpp:requests_processing</code></td>
					<td>gauge</td>
					<td>Requests currently being processed</td>
			</tr>
			<tr>
					<td><code>llamacpp:requests_deferred</code></td>
					<td>gauge</td>
					<td>Requests queued, waiting for a slot</td>
			</tr>
			<tr>
					<td><code>llamacpp:n_decode_total</code></td>
					<td>counter</td>
					<td>Total llama_decode() calls</td>
			</tr>
			<tr>
					<td><code>llamacpp:n_busy_slots_per_decode</code></td>
					<td>counter</td>
					<td>Slots active per decode call</td>
			</tr>
	</tbody>
</table>
<p>These cover the metrics that matter for a personal inference server: throughput, latency (derivable from total time / total tokens), and queue depth.</p>
<hr>
<h2 id="prometheus-scrape-config">Prometheus scrape config</h2>
<p>Adding a static scrape target in the existing Prometheus configuration:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">extraScrapeConfigs</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd">
</span></span></span><span class="line"><span class="cl"><span class="sd">  - job_name: llama-server
</span></span></span><span class="line"><span class="cl"><span class="sd">    static_configs:
</span></span></span><span class="line"><span class="cl"><span class="sd">      - targets:
</span></span></span><span class="line"><span class="cl"><span class="sd">          - llama-server.web-ai-engine.svc:8080
</span></span></span><span class="line"><span class="cl"><span class="sd">    metrics_path: /metrics</span><span class="w">
</span></span></span></code></pre></div><p>The only non-obvious thing here is the network policy: Prometheus lives in <code>dashboard-homelab</code>, and llama-server lives in <code>web-ai-engine</code>. With Cilium network policies enforcing namespace isolation, the dashboard namespace needs to be allowed to make inbound connections to the AI engine namespace. In <code>applications.yml</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl">- <span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">web-ai-engine</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">networkPolicies</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">allowIngressFromNamespaces</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">dashboard-homelab]</span><span class="w">
</span></span></span></code></pre></div><p>Without this, Prometheus scrape attempts fail silently with a timeout.</p>
<hr>
<h2 id="grafana-dashboard-via-configmap">Grafana dashboard via ConfigMap</h2>
<p>Rather than importing a dashboard JSON manually through the Grafana UI, the Grafana sidecar handles it automatically. Any ConfigMap with the label <code>grafana_dashboard: &quot;1&quot;</code> is picked up, loaded, and available in Grafana — across all namespaces by default.</p>
<p>The dashboard ConfigMap lives in <code>web-ai-engine</code>, not <code>dashboard-homelab</code>. The sidecar finds it regardless:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">v1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">ConfigMap</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">grafana-dashboard-llm</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">web-ai-engine</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">labels</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">grafana_dashboard</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;1&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">data</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">llm-metrics.json</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd">
</span></span></span><span class="line"><span class="cl"><span class="sd">    {
</span></span></span><span class="line"><span class="cl"><span class="sd">      &#34;title&#34;: &#34;LLM Metrics&#34;,
</span></span></span><span class="line"><span class="cl"><span class="sd">      &#34;uid&#34;: &#34;llm-metrics&#34;,
</span></span></span><span class="line"><span class="cl"><span class="sd">      ...
</span></span></span><span class="line"><span class="cl"><span class="sd">    }</span><span class="w">
</span></span></span></code></pre></div><p>Argo CD reconciles the ConfigMap. The Grafana sidecar picks it up. The dashboard appears. No manual steps, no Grafana UI interaction, no state outside Git.</p>
<p>This means the dashboard is version-controlled, reproducible on cluster rebuild, and consistent across environments. The same YAML that describes the app&rsquo;s Kubernetes resources also describes what the monitoring looks like.</p>
<hr>
<h2 id="what-the-dashboard-shows">What the dashboard shows</h2>
<p>After sending a few messages through Open WebUI:</p>
<p><strong>Generation throughput</strong> — the <code>llamacpp:predicted_tokens_seconds</code> gauge drops to 0 between requests and spikes during generation. On this hardware (Intel N100, CPU-only inference, Phi-3.5-mini Q4_K_M), it reads 3–5 tok/s during active generation. This is the number to watch if you&rsquo;re comparing models or quantisation levels.</p>
<p><strong>Cumulative tokens</strong> — <code>llamacpp:prompt_tokens_total</code> and <code>llamacpp:tokens_predicted_total</code> both increase monotonically. The ratio between them is roughly the input/output ratio of your usage pattern. For conversational use it&rsquo;s typically 3:1 prompt to generation; for summarisation tasks it flips.</p>
<p><strong>Queue depth</strong> — <code>llamacpp:requests_deferred</code> is 0 almost always, which is expected with <code>--parallel 1</code>. If it&rsquo;s consistently above 0, you have more concurrent users than the server can handle with the current slot configuration.</p>
<p><strong>ms/token</strong> — derived from <code>rate(llamacpp:tokens_predicted_seconds_total[5m]) / rate(llamacpp:tokens_predicted_total[5m]) * 1000</code>. This is the per-token latency, which is the number that governs whether the response feels fast or slow. 200–300ms/token feels instant; above 400ms you start noticing.</p>
<hr>
<h2 id="whats-missing-compared-to-a-proxy-layer">What&rsquo;s missing compared to a proxy layer</h2>
<p>LiteLLM and similar proxies give you things this setup doesn&rsquo;t:</p>
<ul>
<li><strong>Per-model routing</strong> — if you&rsquo;re running multiple models, a proxy can route requests to the right one. With a single model, irrelevant.</li>
<li><strong>Virtual API keys</strong> — per-user or per-application key scoping. Not needed when the whole thing is behind SSO.</li>
<li><strong>Spend tracking</strong> — meaningful when you&rsquo;re paying per token. For a local model, the cost is electricity, which Prometheus already covers through the power monitoring dashboard.</li>
</ul>
<p>For a single-model homelab, the native metrics are sufficient. If I add more models later or need per-user attribution, a proxy layer becomes worth the RAM.</p>
<hr>
<h2 id="the-pattern">The pattern</h2>
<p>The broader point is that the observable unit here isn&rsquo;t the proxy — it&rsquo;s the inference server itself. Scraping llama.cpp directly means the metrics survive proxy changes, backend swaps, or routing redesigns. The inference server is the thing doing the work; it&rsquo;s the right place to measure.</p>
<p>Starter manifests with the metrics configuration included: <a href="https://github.com/janos-gyorgy/homelab-ai-inference-starter">homelab-ai-inference-starter</a></p>
]]></content:encoded></item><item><title>🤖 Local LLM Inference on Kubernetes, No GPU Required</title><link>https://blog.hippotion.com/posts/local-llm-k8s-no-gpu/</link><pubDate>Fri, 15 Aug 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/local-llm-k8s-no-gpu/</guid><description>A CPU-only self-hosted LLM stack running on k3s: llama.cpp as the inference server, Open WebUI as the chat interface, deployed as a single Git push.</description><content:encoded><![CDATA[<h2 id="the-gpu-assumption">The GPU assumption</h2>
<p>Most write-ups about self-hosting LLMs start with a GPU. A 3090, an A100, at minimum something with CUDA. The implication is that without one you&rsquo;re wasting your time — inference will be too slow to be useful.</p>
<p>That&rsquo;s not been my experience.</p>
<p>I&rsquo;ve been running a local LLM stack on a ThinkCentre mini PC (Intel N100, 16 GB RAM, no discrete GPU) for a few months. The model is Phi-3.5-mini-instruct, 3.8 billion parameters, 4-bit quantised. Response time is 3–6 tokens per second on CPU — slow enough that you notice it, fast enough that you use it. For the things I actually reach for a local model to do — rephrase something, summarise a document, explain a config option without sending it to an external API — the latency is fine.</p>
<p>The point isn&rsquo;t that CPU inference beats GPU inference. It&rsquo;s that &ldquo;good enough for personal use&rdquo; is a much lower bar than &ldquo;production LLM serving&rdquo;, and the hardware you already have probably clears it.</p>
<hr>
<h2 id="the-stack">The stack</h2>
<p>Two components:</p>
<p><strong>llama.cpp</strong> (<code>ghcr.io/ggml-org/llama.cpp:server</code>) — inference server that loads a GGUF model file and exposes an OpenAI-compatible REST API. No Python, no framework overhead, minimal memory footprint beyond the model itself.</p>
<p><strong>Open WebUI</strong> (<code>ghcr.io/open-webui/open-webui</code>) — a polished chat interface that speaks OpenAI API format. It points at the llama-server endpoint as its backend, handles conversation history, and supports RAG file uploads out of the box.</p>
<p>The architecture is simple on purpose:</p>
<pre tabindex="0"><code>Browser → Open WebUI (:80)
              │
              │  OpenAI-compatible API
              ▼
         llama-server (:8080)
              │
              │  reads GGUF model file
              ▼
         hostPath /srv/ai-models
</code></pre><p>Open WebUI doesn&rsquo;t know or care that the backend is llama.cpp running on CPU. It sees an OpenAI-compatible API. This matters: if I swap llama-server for Ollama, vLLM, or a cloud endpoint, the frontend doesn&rsquo;t change. The interface is the standard.</p>
<hr>
<h2 id="model-choice">Model choice</h2>
<p>GGUF models on Hugging Face are available at multiple quantisation levels. The trade-off is quality vs. RAM:</p>
<table>
	<thead>
			<tr>
					<th>Model</th>
					<th>Quant</th>
					<th>Size</th>
					<th>RAM at runtime</th>
					<th>Notes</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>Llama-3.2-3B</td>
					<td>Q4_K_M</td>
					<td>~2 GB</td>
					<td>~3 GB</td>
					<td>Fastest, lowest quality</td>
			</tr>
			<tr>
					<td>Phi-3.5-mini</td>
					<td>Q4_K_M</td>
					<td>~2.4 GB</td>
					<td>~3–4 GB</td>
					<td>Good balance — what I use</td>
			</tr>
			<tr>
					<td>Mistral-7B-Instruct</td>
					<td>Q4_K_M</td>
					<td>~4.1 GB</td>
					<td>~5–6 GB</td>
					<td>Noticeably better, needs more RAM</td>
			</tr>
			<tr>
					<td>Llama-3.1-8B</td>
					<td>Q4_K_M</td>
					<td>~4.7 GB</td>
					<td>~6–8 GB</td>
					<td>High quality, stretches 16 GB with other workloads</td>
			</tr>
	</tbody>
</table>
<p>On 16 GB RAM with a full k3s stack running alongside (Argo CD, Traefik, Vault, Prometheus, etc.), Phi-3.5-mini leaves enough headroom that the cluster stays stable. Mistral-7B works too, but it&rsquo;s tighter.</p>
<p>Models live in <code>/srv/ai-models</code> on the node, mounted into the pod as a <code>hostPath</code> volume. Single-node homelab, so there&rsquo;s no scheduling concern. Download once with <code>wget</code>, done.</p>
<hr>
<h2 id="key-configuration-choices">Key configuration choices</h2>
<p><strong>Context size (<code>--ctx-size 4096</code>):</strong> How many tokens the model holds in its attention window. Larger context = more RAM + slower inference. 4096 is fine for conversational use. If you&rsquo;re summarising long documents, bump to 8192 and watch your RAM usage.</p>
<p><strong>Max output tokens (<code>--n-predict 1024</code>):</strong> Hard cap on response length. llama.cpp will stop there even mid-sentence. 1024 is usually enough; increase if you find it cutting off long explanations.</p>
<p><strong>Parallel slots (<code>--parallel 1</code>):</strong> How many concurrent inference requests the server handles. On CPU there&rsquo;s no benefit to more than 1 — each slot competes for the same cores. Leave it at 1.</p>
<p><strong>Memory limits:</strong> Set the container limit to roughly 2× the model&rsquo;s file size. A 2.4 GB GGUF typically uses 3–4 GB at runtime with context loaded.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">resources</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">requests</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="l">500m</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">memory</span><span class="p">:</span><span class="w"> </span><span class="l">1Gi</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">limits</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">memory</span><span class="p">:</span><span class="w"> </span><span class="l">6Gi</span><span class="w">
</span></span></span></code></pre></div><p>No CPU limit. llama-server will use however many cores are available during inference — that&rsquo;s what makes it usable. A CPU limit would throttle inference to unusable speeds.</p>
<hr>
<h2 id="deployment-as-a-gitops-push">Deployment as a GitOps push</h2>
<p>The whole stack lives in one YAML values file, deployed through the <a href="https://github.com/janos-gyorgy/gitops-extra-objects-chart">extra-objects chart</a> that I use for raw manifests across the cluster. Argo CD watches the repo and reconciles automatically.</p>
<p>Nothing was <code>kubectl apply</code>-ed. The deployment happened by pushing to Git.</p>
<p>What that means in practice: when I bumped the Open WebUI image version, I changed one line, pushed, and Argo CD rolled the pod. No manual steps, no SSH, no <code>kubectl</code>. The same process I use for any other service in the cluster.</p>
<p>The namespace, network policies, service account, and RBAC all generate from a single entry in <code>applications.yml</code> — same as every other app. The AI inference stack isn&rsquo;t special from an operations perspective.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># applications.yml excerpt</span><span class="w">
</span></span></span><span class="line"><span class="cl">- <span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">web-ai-engine</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">applications</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="nt">applicationCode</span><span class="p">:</span><span class="w"> </span><span class="l">web-ai-engine</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">path</span><span class="p">:</span><span class="w"> </span><span class="l">helm-charts/extra-objects</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">autoSync</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
</span></span></span></code></pre></div><hr>
<h2 id="access-and-auth">Access and auth</h2>
<p>The service is exposed at <code>ai.hippotion.com</code> through the same dual-path ingress setup I use everywhere: Cloudflare Tunnel for external access, direct-to-server via Pi-hole DNS for local access, Traefik handling both with a wildcard Let&rsquo;s Encrypt cert. See <a href="/posts/homelab-dual-path-tls/">that post</a> for the full explanation.</p>
<p>Auth is handled by Traefik&rsquo;s ForwardAuth middleware pointing at an oauth2-proxy backed by GitLab. Open WebUI&rsquo;s own auth is disabled (<code>WEBUI_AUTH: false</code>) — the OAuth layer upstream handles it. One login covers every service in the cluster.</p>
<p>The <code>WEBUI_SECRET_KEY</code> (used to sign Open WebUI sessions) comes from Vault via External Secrets Operator. Nothing sensitive in Git.</p>
<hr>
<h2 id="what-the-day-to-day-is-actually-like">What the day-to-day is actually like</h2>
<p>Slow is the obvious caveat. Phi-3.5-mini at 3–6 tok/s means a paragraph-length response takes 20–30 seconds. For coding help where you&rsquo;re reading what came before while it generates, that&rsquo;s fine. For quick factual lookups, it&rsquo;s a little tedious.</p>
<p>The useful cases for a local model, for me:</p>
<ul>
<li><strong>Rephrasing or editing text</strong> — paste something, ask it to tighten it. No data leaves the house.</li>
<li><strong>Config explanation</strong> — paste a Kubernetes manifest or a Traefik config block, ask what it does. Again, stays local.</li>
<li><strong>Quick summaries</strong> — short documents, log snippets, error messages.</li>
<li><strong>Experimentation</strong> — trying prompting techniques, testing system prompts, benchmarking quantisation levels without API costs.</li>
</ul>
<p>For longer reasoning tasks I use a cloud model. The local stack is for the cases where I want the answer to stay on-premises, or where I&rsquo;m iterating and don&rsquo;t want to pay per token.</p>
<hr>
<h2 id="the-starting-point-if-you-want-to-try-it">The starting point if you want to try it</h2>
<p>The manifests are on GitHub: <a href="https://github.com/janos-gyorgy/homelab-ai-inference-starter">homelab-ai-inference-starter</a></p>
<p>It includes the llama-server and Open WebUI deployments, resource configuration, and ingress options for Traefik and nginx. The README walks through downloading a model, applying the manifests, and the configuration knobs worth knowing.</p>
<p>No GPU required. The ThinkCentre in the corner of my desk does the job.</p>
]]></content:encoded></item></channel></rss>