<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Selfhosted on hippotion</title><link>https://blog.hippotion.com/tags/selfhosted/</link><description>Recent content in Selfhosted on hippotion</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Fri, 29 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.hippotion.com/tags/selfhosted/index.xml" rel="self" type="application/rss+xml"/><item><title>Every Robot in My House Can Text Me Now</title><link>https://blog.hippotion.com/posts/every-robot-texts-me/</link><pubDate>Fri, 29 May 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/every-robot-texts-me/</guid><description>My house is full of automation that never told me anything — until I gave it one push bus. The first thing I taught it to do was warn me before Claude Code cuts out mid-task.</description><content:encoded><![CDATA[<h2 id="the-silence">The silence</h2>
<p>My house runs on quiet little robots. A tracker watches my kombucha ferment. A
job narrates kids&rsquo; books in Hungarian. A media stack pulls and files things. Home
Assistant minds the sensors. A dozen services, all doing their jobs, all
completely mute. When a batch finished or an import failed, I found out the same
way every time: by going to look.</p>
<p>Then the silence got expensive. Claude Code stopped dead in the middle of a task
because I&rsquo;d burned through my plan&rsquo;s usage window — no warning, no countdown,
just a wall. The information <em>existed</em>; a dashboard in my own cluster was already
polling it. It just had no way to reach my pocket.</p>
<p>So I built one thing: a push bus. One place anything in the cluster can POST to,
that actually buzzes my phone. And the first job I gave it was to warn me before
my AI assistant goes dark.</p>
<hr>
<h2 id="the-boring-part-said-honestly">The boring part (said honestly)</h2>
<p>The bus is <a href="https://ntfy.sh">ntfy</a> — a self-hosted pub/sub notifier. Picking it
took about five minutes, because self-hosting ntfy for a homelab is a thoroughly
solved problem. There are at least three off-the-shelf bridges from Prometheus
Alertmanager to ntfy. I&rsquo;m not going to pretend the bus is the clever bit.</p>
<p>What I <em>did</em> do deliberately:</p>
<ul>
<li>📦 Deployed it <strong>GitOps-native</strong> — one entry in my app-of-apps, reconciled by
Argo CD, no <code>docker run</code> anywhere.</li>
<li>🔒 Locked it to <strong>deny-all auth</strong> with bearer tokens. Security alerts ride this
bus; a world-readable topic on a public URL was a non-starter. (Which also means
it sits <em>outside</em> my usual OAuth gate — the phone app can&rsquo;t do an interactive
login flow, so ntfy does its own token auth.)</li>
<li>🏷️ Topics by severity: <code>hl-crit</code>, <code>hl-warn</code>, <code>hl-info</code>, <code>hl-event</code>. Subscribe
and mute by how much I care.</li>
</ul>
<p>Then the interesting parts showed up at the edges, where they always do.</p>
<hr>
<h2 id="edge-one-my-own-firewall-403d-me">Edge one: my own firewall 403&rsquo;d me</h2>
<p>First test, the usage producer POSTing to <code>https://ntfy.hippotion.com</code>:</p>
<pre tabindex="0"><code>HTTP 403 Forbidden
error code: 1010
</code></pre><p>That <code>1010</code> looks like ntfy rejecting my token. It isn&rsquo;t. <strong>It&rsquo;s Cloudflare.</strong>
Error 1010 means &ldquo;your browser signature is banned&rdquo; — Cloudflare&rsquo;s bot protection
took one look at a Python script&rsquo;s <code>urllib</code> User-Agent and slammed the door.</p>
<p>My own producer couldn&rsquo;t reach my own bus, because the request left the cluster,
went all the way out to my own edge, and got flagged as a bot on the way back in.</p>
<p>The fix is the architecture I should&rsquo;ve had from the start: in-cluster producers
POST to the <strong>internal</strong> service address and never touch the public internet at
all.</p>
<pre tabindex="0"><code># wrong: out to Cloudflare and back, gets bot-blocked
https://ntfy.hippotion.com/hl-warn

# right: stays inside the cluster
http://ntfy.web-ntfy.svc.cluster.local/hl-warn
</code></pre><p>The phone still uses the public URL happily — the real ntfy app carries a
signature Cloudflare trusts. Only scripts trip 1010. <strong>Lesson: your own edge is
not your friend when you&rsquo;re a script. Keep cluster traffic in the cluster.</strong></p>
<hr>
<h2 id="edge-two-the-obvious-data-source-was-lying">Edge two: the obvious data source was lying</h2>
<p>To warn me about Claude usage, the naïve move is to parse Claude Code&rsquo;s local
logs — they sit right there in <code>~/.claude/projects/.../*.jsonl</code>, token counts and
all.</p>
<p>Don&rsquo;t. Those counts are <strong>unreliable for accounting</strong> — known to undercount,
wildly, in some cases by ~100x. Every tool that parses that JSONL inherits the
bug.</p>
<p>The number that&rsquo;s actually true lives in the claude.ai usage API — the same
<code>five_hour</code> and <code>seven_day</code> windows your plan enforces against. And I already had
a service polling exactly that. So the producer is just a tiny sidecar on that
existing pod, reading its <code>/api/usage</code> over <strong>localhost</strong> (same pod — no network
policy to negotiate, no second credential, nothing else hammering claude.ai):</p>
<ul>
<li>📈 ≥80% of a window → <code>hl-warn</code> (high).</li>
<li>🚨 ≥95% → <code>hl-crit</code> (urgent).</li>
<li>🔁 One ping per window per reset cycle, escalating warn→crit, keyed on the
reset timestamp so it never spams.</li>
</ul>
<p>The first time it mattered, my phone buzzed at 80% with hours of runway left
instead of a brick wall mid-task.</p>
<hr>
<h2 id="what-id-tell-past-me">What I&rsquo;d tell past me</h2>
<p>Three things, none of them about ntfy:</p>
<ol>
<li><strong>Reuse the signal you already have.</strong> I didn&rsquo;t build a usage poller — I bolted
a sidecar onto the one already running. The smallest producer is one that reads
localhost.</li>
<li><strong>Your own edge can betray you.</strong> A firewall that protects you from bots will
happily block your own automation. In-cluster talks in-cluster.</li>
<li><strong>Check whether your data source is telling the truth</strong> before you build an
alert on it. An alert you don&rsquo;t trust is worse than no alert — you&rsquo;ll learn to
ignore it, and then it&rsquo;ll be right once.</li>
</ol>
<p>Next, the high-leverage move: point Prometheus Alertmanager at the same bus, and
every infra alert I have — plus every one I&rsquo;ll ever add — lands on the phone
through one bridge. The kombucha ping can wait. The disk-full one can&rsquo;t.</p>
<p>The house is still full of quiet robots. The difference is now they know my
number.</p>
]]></content:encoded></item><item><title>🎙️ Cloning My Own Voice for My Kid's Audiobooks</title><link>https://blog.hippotion.com/posts/clone-your-voice-hungarian-audiobooks/</link><pubDate>Fri, 13 Mar 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/clone-your-voice-hungarian-audiobooks/</guid><description>Zero-shot voice cloning with XTTS-v2 on a CPU-only k3s node: 26 seconds of phone audio in, a cloned-voice audiobook out — and an honest verdict from the bedtime jury. Every manual step, including the ones that went wrong.</description><content:encoded><![CDATA[<h2 id="the-problem-nobody-sells-a-fix-for">The problem nobody sells a fix for</h2>
<p>My kid loves audiobooks. The commercial platforms barely carry Hungarian
children&rsquo;s books, and none of them carry the one narrator my kid actually
prefers: me. I can&rsquo;t read aloud every evening — but my homelab doesn&rsquo;t have
that excuse.</p>
<p>The platform half (ebook → M4B → Audiobookshelf on k3s) is a story for
another post. This one is about the voice: how to go from a phone recording
to an audiobook narrated in your own voice, step by step, on hardware with
no GPU.</p>
<p>The short version: <strong>XTTS-v2 does zero-shot voice cloning from a ~20-second
sample.</strong> No training, no fine-tuning, no dataset. One clean recording and a
flag.</p>
<hr>
<h2 id="why-xtts-v2-in-2026">Why XTTS-v2, in 2026?</h2>
<p>It&rsquo;s not the best open TTS model anymore. Chatterbox beats ElevenLabs in
blind tests; F5-TTS sounds cleaner. But model selection for a small language
is constraint-first, not leaderboard-first: Chatterbox has <strong>no Hungarian</strong>,
NVIDIA&rsquo;s TTS NIMs have <strong>no Hungarian</strong>, Kokoro — no Hungarian. XTTS-v2
speaks Hungarian <em>and</em> clones voices <em>and</em> runs on CPU. That intersection
has exactly one resident.</p>
<p>I run it via <a href="https://github.com/DrewThomasson/ebook2audiobook">ebook2audiobook</a>,
which wraps XTTS with Calibre ingestion and M4B chaptering.</p>
<hr>
<h2 id="step-1--record-25-seconds-of-yourself">Step 1 — Record ~25 seconds of yourself</h2>
<p>Phone voice-memo app, quiet room, ~20 cm from your mouth. Mine came out as
28 seconds of stereo 48 kHz AAC. Two rules that matter more than gear:</p>
<ul>
<li><strong>Read the way you want the books narrated.</strong> The clone copies prosody —
energy, pacing, warmth — not just timbre. A flat recital clones into a
flat narrator. I read a children&rsquo;s tale the way I&rsquo;d read it at bedtime.</li>
<li><strong>Don&rsquo;t peak the mic.</strong> My sample hit −0.1 dB max volume — right at the
clipping ceiling. It worked, but quieter is safer. Check yours:</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">ffmpeg -i janos.m4a -af volumedetect -f null - 2&gt;<span class="p">&amp;</span><span class="m">1</span> <span class="p">|</span> grep volume
</span></span><span class="line"><span class="cl"><span class="c1"># mean_volume: -21.4 dB   ← fine</span>
</span></span><span class="line"><span class="cl"><span class="c1"># max_volume:  -0.1 dB    ← living dangerously</span>
</span></span></code></pre></div><hr>
<h2 id="step-2--normalize-to-what-xtts-wants">Step 2 — Normalize to what XTTS wants</h2>
<p>XTTS expects a mono WAV; 24 kHz matches its internal rate. Trim the silence
off both ends while you&rsquo;re at it:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">ffmpeg -i janos.m4a <span class="se">\
</span></span></span><span class="line"><span class="cl">  -af <span class="s2">&#34;silenceremove=start_periods=1:start_threshold=-45dB:start_silence=0.2,\
</span></span></span><span class="line"><span class="cl"><span class="s2">areverse,silenceremove=start_periods=1:start_threshold=-45dB:start_silence=0.2,\
</span></span></span><span class="line"><span class="cl"><span class="s2">areverse&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">  -ar <span class="m">24000</span> -ac <span class="m">1</span> janos.wav
</span></span></code></pre></div><p>(The double-<code>areverse</code> is the classic trick: <code>silenceremove</code> only trims the
front, so you flip the audio, trim the front again, flip it back.)</p>
<p>Drop the result where your TTS stack looks for voices. In ebook2audiobook
that&rsquo;s the <code>voices/</code> tree, organised by language:</p>
<pre tabindex="0"><code>voices/hun/adult/male/janos.wav
</code></pre><hr>
<h2 id="step-3--synthesize">Step 3 — Synthesize</h2>
<p>One flag does the cloning. Headless run on the k3s pod:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">kubectl <span class="nb">exec</span> -n web-audiobooks deploy/ebook2audiobook -- sh -c <span class="se">\
</span></span></span><span class="line"><span class="cl">  <span class="s1">&#39;cd /app &amp;&amp; python app.py --headless \
</span></span></span><span class="line"><span class="cl"><span class="s1">     --ebook &#34;/app/ebooks/tale.txt&#34; \
</span></span></span><span class="line"><span class="cl"><span class="s1">     --language hun \
</span></span></span><span class="line"><span class="cl"><span class="s1">     --tts_engine xtts \
</span></span></span><span class="line"><span class="cl"><span class="s1">     --device cpu \
</span></span></span><span class="line"><span class="cl"><span class="s1">     --voice /app/voices/hun/adult/male/janos.wav \
</span></span></span><span class="line"><span class="cl"><span class="s1">     --output_format m4b \
</span></span></span><span class="line"><span class="cl"><span class="s1">     --output_dir /app/audiobooks&#39;</span>
</span></span></code></pre></div><p>On my 12-core CPU node this runs at roughly 3× real-time — a 2-minute tale
takes ~8 minutes, a full children&rsquo;s book is an overnight job. The first run
computes speaker latents from your WAV; after that it&rsquo;s ordinary synthesis
with your voice as the reference.</p>
<hr>
<h2 id="step-4--ab-before-you-batch">Step 4 — A/B before you batch</h2>
<p>Render one <em>short</em> book twice — stock narrator and cloned voice — and put
both in front of the household jury. Cloning quality is personal in the most
literal sense: MOS scores won&rsquo;t tell you whether it sounds like <em>you</em>. My
benchmark has strong opinions and goes to bed at eight.</p>
<p>Only after the clone passes do you re-render the library with <code>--voice</code>.</p>
<p><img alt="Audiobookshelf library with the same tale twice: stock narrator and the &ldquo;apa hangján&rdquo; clone, side by side for the jury" loading="lazy" src="/posts/clone-your-voice-hungarian-audiobooks/abs-ab.png"></p>
<hr>
<h2 id="the-manual-steps-that-earn-the-word-manual">The manual steps that earn the word &ldquo;manual&rdquo;</h2>
<p>Things the tutorials skip, learned the slow way:</p>
<ul>
<li><strong>Long conversions die with the browser tab.</strong> Gradio-style web UIs tie
the job to the open page; close the laptop and you get &ldquo;Conversion
cancelled&rdquo; half a book in. Anything longer than ~15 minutes of audio runs
headless under <code>nohup</code>.</li>
<li><strong>CPU synthesis leaks memory over hours.</strong> My pod has a hard 6 Gi limit on
a 16 Gi node, and a 6-hour run will hit it. Keep the cap (it protects the
other 30 namespaces), and rely on the tool&rsquo;s <code>--session &lt;id&gt;</code> resume — it
picks up at the exact sentence. One catch: headless resume still asks an
interactive <code>Resume? [y]es</code> — pipe <code>echo y |</code> into it.</li>
<li><strong>The per-chapter FLACs survive a crash.</strong> If the final M4B muxing step
OOMs, don&rsquo;t re-synthesize: the chapters are sitting in the session&rsquo;s tmp
directory, and <code>ffmpeg</code> will assemble them into a chaptered M4B with a
hand-written FFMETADATA file in about two minutes, at near-zero memory.</li>
</ul>
<p>None of this is hard. It&rsquo;s just undocumented — which is the gap between
&ldquo;there&rsquo;s a model for that&rdquo; and your kid pressing play.</p>
<hr>
<h2 id="postscript-the-jury-came-back">Postscript: the jury came back</h2>
<p>The clone failed. Recognizably my timbre, nowhere near natural — I wouldn&rsquo;t
play it to my kid, which is the only metric that exists for this project.</p>
<p>Worth being precise about <em>what</em> failed: the stock XTTS-v2 narrator passed
the ear test and the library keeps growing with it. Zero-shot <strong>cloning</strong> is
the part that fell short — a 2023 model conditioning on 26 seconds of a
voice it has never seen, in a language that was never its strong suit. The
pipeline above is still the right pipeline; the model isn&rsquo;t there yet on
CPU-class options.</p>
<p>The next experiment is already picked: <a href="https://huggingface.co/Maxdorger29/f5-tts-hungarian">F5-TTS Hungarian</a>,
a 2026 fine-tune on 280 hours of actual Hungarian speech, built precisely
for short-sample cloning. It needs CUDA, which my node doesn&rsquo;t have — but a
rented spot GPU tests it for the price of an espresso. If it passes the
bedtime jury, that&rsquo;ll be its own post.</p>
<p>Negative results are results. The jury reconvenes when the GPU shows up.</p>
]]></content:encoded></item><item><title>🤖 Local LLM Inference on Kubernetes, No GPU Required</title><link>https://blog.hippotion.com/posts/local-llm-k8s-no-gpu/</link><pubDate>Fri, 15 Aug 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/local-llm-k8s-no-gpu/</guid><description>A CPU-only self-hosted LLM stack running on k3s: llama.cpp as the inference server, Open WebUI as the chat interface, deployed as a single Git push.</description><content:encoded><![CDATA[<h2 id="the-gpu-assumption">The GPU assumption</h2>
<p>Most write-ups about self-hosting LLMs start with a GPU. A 3090, an A100, at minimum something with CUDA. The implication is that without one you&rsquo;re wasting your time — inference will be too slow to be useful.</p>
<p>That&rsquo;s not been my experience.</p>
<p>I&rsquo;ve been running a local LLM stack on a ThinkCentre mini PC (Intel N100, 16 GB RAM, no discrete GPU) for a few months. The model is Phi-3.5-mini-instruct, 3.8 billion parameters, 4-bit quantised. Response time is 3–6 tokens per second on CPU — slow enough that you notice it, fast enough that you use it. For the things I actually reach for a local model to do — rephrase something, summarise a document, explain a config option without sending it to an external API — the latency is fine.</p>
<p>The point isn&rsquo;t that CPU inference beats GPU inference. It&rsquo;s that &ldquo;good enough for personal use&rdquo; is a much lower bar than &ldquo;production LLM serving&rdquo;, and the hardware you already have probably clears it.</p>
<hr>
<h2 id="the-stack">The stack</h2>
<p>Two components:</p>
<p><strong>llama.cpp</strong> (<code>ghcr.io/ggml-org/llama.cpp:server</code>) — inference server that loads a GGUF model file and exposes an OpenAI-compatible REST API. No Python, no framework overhead, minimal memory footprint beyond the model itself.</p>
<p><strong>Open WebUI</strong> (<code>ghcr.io/open-webui/open-webui</code>) — a polished chat interface that speaks OpenAI API format. It points at the llama-server endpoint as its backend, handles conversation history, and supports RAG file uploads out of the box.</p>
<p>The architecture is simple on purpose:</p>
<pre tabindex="0"><code>Browser → Open WebUI (:80)
              │
              │  OpenAI-compatible API
              ▼
         llama-server (:8080)
              │
              │  reads GGUF model file
              ▼
         hostPath /srv/ai-models
</code></pre><p>Open WebUI doesn&rsquo;t know or care that the backend is llama.cpp running on CPU. It sees an OpenAI-compatible API. This matters: if I swap llama-server for Ollama, vLLM, or a cloud endpoint, the frontend doesn&rsquo;t change. The interface is the standard.</p>
<hr>
<h2 id="model-choice">Model choice</h2>
<p>GGUF models on Hugging Face are available at multiple quantisation levels. The trade-off is quality vs. RAM:</p>
<table>
	<thead>
			<tr>
					<th>Model</th>
					<th>Quant</th>
					<th>Size</th>
					<th>RAM at runtime</th>
					<th>Notes</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>Llama-3.2-3B</td>
					<td>Q4_K_M</td>
					<td>~2 GB</td>
					<td>~3 GB</td>
					<td>Fastest, lowest quality</td>
			</tr>
			<tr>
					<td>Phi-3.5-mini</td>
					<td>Q4_K_M</td>
					<td>~2.4 GB</td>
					<td>~3–4 GB</td>
					<td>Good balance — what I use</td>
			</tr>
			<tr>
					<td>Mistral-7B-Instruct</td>
					<td>Q4_K_M</td>
					<td>~4.1 GB</td>
					<td>~5–6 GB</td>
					<td>Noticeably better, needs more RAM</td>
			</tr>
			<tr>
					<td>Llama-3.1-8B</td>
					<td>Q4_K_M</td>
					<td>~4.7 GB</td>
					<td>~6–8 GB</td>
					<td>High quality, stretches 16 GB with other workloads</td>
			</tr>
	</tbody>
</table>
<p>On 16 GB RAM with a full k3s stack running alongside (Argo CD, Traefik, Vault, Prometheus, etc.), Phi-3.5-mini leaves enough headroom that the cluster stays stable. Mistral-7B works too, but it&rsquo;s tighter.</p>
<p>Models live in <code>/srv/ai-models</code> on the node, mounted into the pod as a <code>hostPath</code> volume. Single-node homelab, so there&rsquo;s no scheduling concern. Download once with <code>wget</code>, done.</p>
<hr>
<h2 id="key-configuration-choices">Key configuration choices</h2>
<p><strong>Context size (<code>--ctx-size 4096</code>):</strong> How many tokens the model holds in its attention window. Larger context = more RAM + slower inference. 4096 is fine for conversational use. If you&rsquo;re summarising long documents, bump to 8192 and watch your RAM usage.</p>
<p><strong>Max output tokens (<code>--n-predict 1024</code>):</strong> Hard cap on response length. llama.cpp will stop there even mid-sentence. 1024 is usually enough; increase if you find it cutting off long explanations.</p>
<p><strong>Parallel slots (<code>--parallel 1</code>):</strong> How many concurrent inference requests the server handles. On CPU there&rsquo;s no benefit to more than 1 — each slot competes for the same cores. Leave it at 1.</p>
<p><strong>Memory limits:</strong> Set the container limit to roughly 2× the model&rsquo;s file size. A 2.4 GB GGUF typically uses 3–4 GB at runtime with context loaded.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">resources</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">requests</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="l">500m</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">memory</span><span class="p">:</span><span class="w"> </span><span class="l">1Gi</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">limits</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">memory</span><span class="p">:</span><span class="w"> </span><span class="l">6Gi</span><span class="w">
</span></span></span></code></pre></div><p>No CPU limit. llama-server will use however many cores are available during inference — that&rsquo;s what makes it usable. A CPU limit would throttle inference to unusable speeds.</p>
<hr>
<h2 id="deployment-as-a-gitops-push">Deployment as a GitOps push</h2>
<p>The whole stack lives in one YAML values file, deployed through the <a href="https://github.com/janos-gyorgy/gitops-extra-objects-chart">extra-objects chart</a> that I use for raw manifests across the cluster. Argo CD watches the repo and reconciles automatically.</p>
<p>Nothing was <code>kubectl apply</code>-ed. The deployment happened by pushing to Git.</p>
<p>What that means in practice: when I bumped the Open WebUI image version, I changed one line, pushed, and Argo CD rolled the pod. No manual steps, no SSH, no <code>kubectl</code>. The same process I use for any other service in the cluster.</p>
<p>The namespace, network policies, service account, and RBAC all generate from a single entry in <code>applications.yml</code> — same as every other app. The AI inference stack isn&rsquo;t special from an operations perspective.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># applications.yml excerpt</span><span class="w">
</span></span></span><span class="line"><span class="cl">- <span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">web-ai-engine</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">applications</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="nt">applicationCode</span><span class="p">:</span><span class="w"> </span><span class="l">web-ai-engine</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">path</span><span class="p">:</span><span class="w"> </span><span class="l">helm-charts/extra-objects</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">autoSync</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
</span></span></span></code></pre></div><hr>
<h2 id="access-and-auth">Access and auth</h2>
<p>The service is exposed at <code>ai.hippotion.com</code> through the same dual-path ingress setup I use everywhere: Cloudflare Tunnel for external access, direct-to-server via Pi-hole DNS for local access, Traefik handling both with a wildcard Let&rsquo;s Encrypt cert. See <a href="/posts/homelab-dual-path-tls/">that post</a> for the full explanation.</p>
<p>Auth is handled by Traefik&rsquo;s ForwardAuth middleware pointing at an oauth2-proxy backed by GitLab. Open WebUI&rsquo;s own auth is disabled (<code>WEBUI_AUTH: false</code>) — the OAuth layer upstream handles it. One login covers every service in the cluster.</p>
<p>The <code>WEBUI_SECRET_KEY</code> (used to sign Open WebUI sessions) comes from Vault via External Secrets Operator. Nothing sensitive in Git.</p>
<hr>
<h2 id="what-the-day-to-day-is-actually-like">What the day-to-day is actually like</h2>
<p>Slow is the obvious caveat. Phi-3.5-mini at 3–6 tok/s means a paragraph-length response takes 20–30 seconds. For coding help where you&rsquo;re reading what came before while it generates, that&rsquo;s fine. For quick factual lookups, it&rsquo;s a little tedious.</p>
<p>The useful cases for a local model, for me:</p>
<ul>
<li><strong>Rephrasing or editing text</strong> — paste something, ask it to tighten it. No data leaves the house.</li>
<li><strong>Config explanation</strong> — paste a Kubernetes manifest or a Traefik config block, ask what it does. Again, stays local.</li>
<li><strong>Quick summaries</strong> — short documents, log snippets, error messages.</li>
<li><strong>Experimentation</strong> — trying prompting techniques, testing system prompts, benchmarking quantisation levels without API costs.</li>
</ul>
<p>For longer reasoning tasks I use a cloud model. The local stack is for the cases where I want the answer to stay on-premises, or where I&rsquo;m iterating and don&rsquo;t want to pay per token.</p>
<hr>
<h2 id="the-starting-point-if-you-want-to-try-it">The starting point if you want to try it</h2>
<p>The manifests are on GitHub: <a href="https://github.com/janos-gyorgy/homelab-ai-inference-starter">homelab-ai-inference-starter</a></p>
<p>It includes the llama-server and Open WebUI deployments, resource configuration, and ingress options for Traefik and nginx. The README walks through downloading a model, applying the manifests, and the configuration knobs worth knowing.</p>
<p>No GPU required. The ThinkCentre in the corner of my desk does the job.</p>
]]></content:encoded></item></channel></rss>