<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Ai-Agents on hippotion</title><link>https://blog.hippotion.com/tags/ai-agents/</link><description>Recent content in Ai-Agents on hippotion</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Fri, 07 Nov 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.hippotion.com/tags/ai-agents/index.xml" rel="self" type="application/rss+xml"/><item><title>🍵 I A/B-Tested Cloud vs Local LLMs in One n8n Agent. The Local One Faked It.</title><link>https://blog.hippotion.com/posts/n8n-agent-cloud-vs-local/</link><pubDate>Fri, 07 Nov 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/n8n-agent-cloud-vs-local/</guid><description>I built an AI agent in self-hosted n8n over my kombucha-tracking app, then gave it two brains — NVIDIA&amp;rsquo;s 70B and a local Phi-3.5 — sharing the same tools. The cloud model called the tools and answered from real data. The local one couldn&amp;rsquo;t, so it made things up.</description><content:encoded><![CDATA[<h2 id="the-question">The question</h2>
<p>I run <a href="https://n8n.io">n8n</a> on my k3s homelab. Not docker-compose on a NUC — the full treatment: GitOps-reconciled, Vault-backed secrets, default-deny networking. The same boring platform everything else here runs on.</p>
<p>But &ldquo;I have n8n running&rdquo; proves nothing. I wanted to know if I actually understood it as an <em>agent platform</em>, and to answer a question I kept dodging: <strong>for agent work, do I need a cloud model, or is my local one good enough?</strong></p>
<p>So I built a real agent and gave it two brains.</p>
<h2 id="what-i-built">What I built</h2>
<p>A chat assistant over brew-buddy, my homemade kombucha-tracking app (React + a small API + Postgres). You ask it things in plain language; it calls the app&rsquo;s API and answers. The twist: the same question runs through <strong>two agents in parallel</strong> — one backed by NVIDIA&rsquo;s hosted <strong>Llama-3.3-70B</strong>, one by a local <strong>Phi-3.5-mini</strong> on CPU — and the workflow prints both answers side by side.</p>
<pre tabindex="0"><code>Chat ──▶ Agent (cloud: NVIDIA 70B) ──┐   tools (shared):
     └─▶ Agent (local: Phi-3.5)   ──┤     • get_all_batches
                                    │     • get_batch_detail
                                    │     • brewing_statistics
            (Merge) ──▶ both replies, labeled     • add_batch_log   ⟵ write
                                                  • create_batch    ⟵ write
</code></pre><p>Both agents share the same read tools. The two <em>write</em> tools are wired to the cloud agent only — more on that below.</p>
<p><img alt="The kombucha agent in n8n: a chat trigger fans out to two AI Agent nodes (cloud and local), both wired to the same brew-buddy tools, then merged so the two answers print side by side." loading="lazy" src="/posts/n8n-agent-cloud-vs-local/n8n.png"></p>
<p>The nice part: I didn&rsquo;t write a line of glue. n8n&rsquo;s stock <strong>OpenAI Chat Model</strong> node talks to anything OpenAI-compatible if you override the credential&rsquo;s Base URL — so one node points at <code>https://integrate.api.nvidia.com/v1</code>, the other at <code>http://llama-server.&lt;ns&gt;.svc:8080/v1</code> for the local server. Same node, two endpoints.</p>
<h2 id="the-infra-that-keeps-it-honest">The infra that keeps it honest</h2>
<p>I won&rsquo;t re-explain the platform here — it&rsquo;s in earlier posts: <a href="/posts/homelab-gitops/">GitOps</a>, <a href="/posts/k8s-gitops-secrets/">Vault-backed secrets</a>, <a href="/posts/k8s-network-isolation/">default-deny networking</a>, <a href="/posts/homelab-dual-path-tls/">dual-path TLS ingress</a>. But building the agent made one of them <em>tangible</em>.</p>
<p>n8n is, by design, a thing that makes arbitrary HTTP calls on a schedule. That&rsquo;s exactly what you want behind a default-deny network policy. n8n couldn&rsquo;t reach the brew-buddy API at all until I declared it — one line:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># n8n&#39;s namespace</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">allowEgressToNamespaces</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">web-ai-engine, web-brew-buddy]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c">#                                          ^ added this for the agent</span><span class="w">
</span></span></span></code></pre></div><p>(plus a matching ingress-allow on brew-buddy&rsquo;s side). That&rsquo;s the posture working as intended: the blast radius of a workflow tool is whatever I&rsquo;ve explicitly granted, and not one namespace more. Adding a capability is a reviewable one-liner in Git; Argo reconciles it. No <code>kubectl</code>, no guessing what n8n can reach.</p>
<h2 id="the-ab-same-agent-same-tools-two-brains">The A/B: same agent, same tools, two brains</h2>
<p><strong>Plain &ldquo;hi&rdquo;.</strong> Cloud answers in ~0.5s. Local takes noticeably longer — because even for &ldquo;hi&rdquo;, the agent feeds the model the full system prompt <em>plus the JSON schemas for every tool</em>, and Phi-3.5 has to chew through all of it on CPU before it can say a word. So far, the boring expected result: local is slower.</p>
<p>Then I asked a real question, and the result flipped in a way I didn&rsquo;t expect.</p>
<p><strong>&ldquo;What batches do I have?&rdquo;</strong></p>
<p>Cloud (70B) called <code>get_all_batches</code>, got the real rows, and answered:</p>
<blockquote>
<p>You have two batches: 2026-04-09-A (cold-crash, 3L) and 2026-04-09-W (cold-crash, 3L).</p>
</blockquote>
<p>Local (Phi-3.5) <strong>never called the tool.</strong> It didn&rsquo;t seem to realise it <em>had</em> tools. Instead it confidently explained how <em>I</em> could go find the data myself:</p>
<blockquote>
<p>To list all batches: 1. Access the brew-buddy app. 2. Look for a button labeled &ldquo;List Batches&rdquo;… <code>def get_all_batches(): …</code> … Remember, I&rsquo;m unable to directly interact with apps or databases.</p>
</blockquote>
<p>Fake instructions. Fake code. A polite apology. Everything except the actual answer it was sitting on top of.</p>
<p><strong>Writing data.</strong> I asked both to <em>log</em> an observation. Cloud called <code>add_batch_log</code> and wrote a real row to Postgres (&ldquo;I have recorded the observation…&rdquo;). Local bluffed again — &ldquo;here&rsquo;s how <em>you</em> can log it yourself.&rdquo;</p>
<h2 id="why-it-matters-capability-not-latency">Why it matters: capability, not latency</h2>
<p>The interesting finding isn&rsquo;t &ldquo;the big model is better.&rdquo; It&rsquo;s <em>how</em> the small one fails.</p>
<p>With a ~3.8B model on CPU, the bottleneck for agent work isn&rsquo;t speed — it&rsquo;s <strong>capability</strong>. Phi-3.5 couldn&rsquo;t reliably emit tool calls, so n8n&rsquo;s tools never fired, and the model degraded into a chatbot that <strong>hallucinates a plausible answer instead of fetching the real one.</strong> That failure mode is worse than an error: an error you catch, a confident wrong answer you ship.</p>
<p>A couple of measurements that sharpened it:</p>
<ul>
<li>NVIDIA 70B, <strong>plain chat</strong>: ~0.5s.</li>
<li>NVIDIA 70B, <strong>function-calling</strong> (with tool schemas): ~8.6s per round-trip — and an agent makes several round-trips per answer. That&rsquo;s real latency you have to budget a timeout for. (It&rsquo;s also why the cloud side initially <em>timed out</em> in n8n until I raised the model node&rsquo;s timeout — the model was fine, n8n was cutting it off.)</li>
</ul>
<p>So the snappy-vs-slow comparison <strong>flips depending on whether the question triggers tools</strong>. Plain chat: cloud wins on speed. Tool use: the local model is &ldquo;fast&rdquo; only because it skips the tools and makes something up. Speed was never the real axis.</p>
<p>The honest caveat: this is <em>this</em> small general model in a multi-tool agent loop. Purpose-built small models with tool-calling fine-tunes do better at narrow tasks — I run a 1.7B one elsewhere that emits a single structured tool call just fine. But for &ldquo;pick the right tool from several and chain them,&rdquo; 70B was in a different league.</p>
<h2 id="the-trust-boundary">The trust boundary</h2>
<p>I gave the write tools (<code>add_batch_log</code>, <code>create_batch</code>) to the cloud agent <strong>only</strong>. The local agent is read-only — not by instruction, by wiring. Even if Phi-3.5 <em>did</em> decide to call a write tool, the connection isn&rsquo;t there. The reliable model is the only one allowed to mutate real data, and that&rsquo;s enforced structurally, not by trusting a prompt.</p>
<h2 id="whats-toy-and-whats-real">What&rsquo;s toy and what&rsquo;s real</h2>
<p>Worth being straight: this is a <strong>single-node homelab</strong>. The agent and both model paths share one box. Running n8n on Kubernetes and swapping models isn&rsquo;t novel — <a href="https://docs.n8n.io/hosting/scaling/queue-mode/">n8n&rsquo;s own docs</a> cover queue mode, where a main instance fans work out to a pool of worker pods you scale horizontally, with external Postgres for state. That&rsquo;s the real production shape. Mine is one replica with an emptyDir&rsquo;s worth of ambition.</p>
<p>What I think <em>is</em> worth sharing is the finding (the capability cliff, and that its failure mode is confident fabrication) and the boring thing underneath it: because the platform is default-deny and GitOps-reconciled, running this experiment cost me one reviewable egress line and zero risk to anything else.</p>
<h2 id="the-boring-part-is-the-point">The boring part is the point</h2>
<p>The AI was the fun bit. But the reason I could bolt an agent onto a live cluster, point it at a real app, give it write access to one model and not the other, and tear it all down again — without worrying what it might touch — is that the infrastructure was already boring. Default-deny. Secrets out of Git. <code>git push</code>, Argo reconciles.</p>
<p>The model picks the tools. The platform decides what the tools can reach. Keep those two honest about each other and self-hosting an agent stops being scary and starts being just another app.</p>
]]></content:encoded></item></channel></rss>