<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Search on hippotion</title><link>https://blog.hippotion.com/tags/search/</link><description>Recent content in Search on hippotion</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Fri, 05 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.hippotion.com/tags/search/index.xml" rel="self" type="application/rss+xml"/><item><title>I Added a Knowledge Graph to My Search. It Made It Worse.</title><link>https://blog.hippotion.com/posts/graph-hurt-my-search/</link><pubDate>Fri, 05 Jun 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/graph-hurt-my-search/</guid><description>My second brain searches over a vault of markdown using BM25 + vectors + graph expansion. I&amp;rsquo;d been telling people the graph improved recall. Then I finally benchmarked it, and plain keyword search beat my fancy hybrid — the graph was actively dragging the right answers out of the results. Here&amp;rsquo;s the scorecard and what it taught me about where graphs actually belong.</description><content:encoded><![CDATA[<p>I have a note in my second brain that I wrote months ago. It says, with the
confidence of someone who hadn&rsquo;t measured anything:</p>
<blockquote>
<p>Combining lexical search (BM25) with vector similarity and graph expansion
produces more robust recall than embeddings alone.</p>
</blockquote>
<p>That sentence shipped into production. My <a href="/posts/a-second-brain-you-can-git-clone/">vault of markdown notes</a>
gets indexed into a search database, and the search function fuses three
signals: BM25 (classic keyword ranking), vector similarity (embeddings), and
<strong>graph expansion</strong> — when a note matches, pull in its linked neighbours too,
on the theory that the thing you want is often <em>next to</em> the thing you typed.</p>
<p>It sounds right. Graphs are having a moment in RAG. &ldquo;Add a knowledge graph to
your retrieval&rdquo; is the kind of thing you can put on a slide and nobody pushes
back. I believed it enough to make graph expansion a first-class signal with a
weight of <code>0.5</code> — equal footing with keyword matching.</p>
<p>This week I finally wrote a benchmark. The graph wasn&rsquo;t helping. It was the
single biggest thing <em>hurting</em> my search.</p>
<h2 id="the-setup">The setup</h2>
<p>30 gold queries against the live vault (63 notes), borrowing the harness shape
from an eval framework I&rsquo;d been reading. Each query has a hand-labelled &ldquo;correct&rdquo;
note. I measured recall@5 (did the right note land in the top 5?) and MRR (how
high did it rank?), across three retrievers:</p>
<ul>
<li><strong>grep</strong> — naive substring term-count. The dumb floor.</li>
<li><strong>bm25</strong> — pure keyword ranking, FTS5&rsquo;s BM25. The honest baseline.</li>
<li><strong>live</strong> — my production hybrid (BM25 + vector + graph).</li>
</ul>
<p>I expected a clean staircase: grep at the bottom, bm25 in the middle, my
clever hybrid on top. That&rsquo;s the whole reason you build the clever thing.</p>
<h2 id="the-scorecard">The scorecard</h2>
<table>
	<thead>
			<tr>
					<th>retriever</th>
					<th>recall@5</th>
					<th>MRR</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>grep</td>
					<td>0.467</td>
					<td>0.307</td>
			</tr>
			<tr>
					<td>bm25</td>
					<td><strong>0.950</strong></td>
					<td><strong>0.826</strong></td>
			</tr>
			<tr>
					<td>live (hybrid, <code>w_graph=0.5</code>)</td>
					<td>0.650</td>
					<td>0.520</td>
			</tr>
	</tbody>
</table>
<p>Read that bottom row twice. My production &ldquo;smart&rdquo; search found the right note
<strong>65%</strong> of the time. Plain keyword search found it <strong>95%</strong> of the time. The
hybrid I&rsquo;d been quietly proud of was <em>worse than its own baseline</em> — it broke
<strong>9 of 30 queries that BM25 got right</strong>. BM25 alone whiffed on exactly one.</p>
<p>The clever layer wasn&rsquo;t adding intelligence. It was adding noise, confidently.</p>
<h2 id="why-the-graph-backfired">Why the graph backfired</h2>
<p>Here&rsquo;s the mechanism, and it&rsquo;s almost funny once you see it.</p>
<p>Graph expansion pulls in a matched note&rsquo;s neighbours. But in a real knowledge
base, the most <em>connected</em> notes are hubs — my inbox of ideas, my project radar,
my &ldquo;things Claude noticed&rdquo; log. Everything links to them, so they&rsquo;re everyone&rsquo;s
neighbour. When I searched for something specific, the graph helpfully dragged
these popularity-contest winners into the candidate set, and they elbowed the
genuinely relevant note clean out of the top 5.</p>
<p>Concrete example. Query: <em>&ldquo;who owns this knowledge system?&rdquo;</em> The correct answer
is my personal note. BM25 ranked it #5 — just barely in. The hybrid, drunk on
graph neighbours, pushed it off the list entirely. The graph didn&rsquo;t find a
better answer. It buried a good one under hubs.</p>
<p>I swept the graph weight to confirm it wasn&rsquo;t a fluke. It was perfectly
monotonic — <strong>every</strong> increment of graph made search worse:</p>
<table>
	<thead>
			<tr>
					<th>graph weight</th>
					<th>recall@5</th>
					<th>MRR</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>0.0 (off)</td>
					<td>0.950</td>
					<td>0.826</td>
			</tr>
			<tr>
					<td>0.1</td>
					<td>0.950</td>
					<td>0.737</td>
			</tr>
			<tr>
					<td>0.25</td>
					<td>0.817</td>
					<td>0.564</td>
			</tr>
			<tr>
					<td>0.5 (what I shipped)</td>
					<td>0.650</td>
					<td>0.520</td>
			</tr>
	</tbody>
</table>
<p>There&rsquo;s no ambiguity to argue with. More graph, more harm, no exceptions. The
value I&rsquo;d been <em>claiming</em> in that confident note — I finally measured it, and
it was negative.</p>
<h2 id="the-fix-and-the-actual-lesson">The fix, and the actual lesson</h2>
<p>The fix was one line: drop the default graph weight from <code>0.5</code> to <code>0.1</code>. Recall
snapped back to 0.95, tying pure BM25. (Turning the graph fully off is
marginally better still on MRR; I kept a whisper of it as a tiebreaker, which is
a taste call, not a data-driven one.)</p>
<p>But the one-line fix isn&rsquo;t the point. The point is <em>where graphs belong</em>.</p>
<p>Graph expansion isn&rsquo;t a bad idea — I aimed it at the wrong job. <strong>Precision
retrieval</strong> (&ldquo;find me the one note that answers this&rdquo;) wants to be narrow and
literal. Pulling in neighbours is the opposite of what you want; every neighbour
is a chance to be wrong. But I have a <em>different</em> feature in this same system —
a discovery mode that deliberately collides distant notes to surface unexpected
connections. There, neighbour-pulling isn&rsquo;t noise, it&rsquo;s the entire product.</p>
<p>Same mechanism. One context it&rsquo;s poison, the other it&rsquo;s the point. I&rsquo;d been
running my discovery tool inside my lookup tool and calling it a hybrid.</p>
<p>A few honest caveats, because a benchmark you can&rsquo;t poke holes in is usually
lying: my gold set is self-authored v1, the corpus is small (63 notes), and the
vector signal was actually <em>dark</em> during this run — I hadn&rsquo;t built the
embeddings yet, so &ldquo;hybrid&rdquo; here was really &ldquo;BM25 + graph.&rdquo; The vector half of
my original claim is still untested. This is directional, not gospel.</p>
<p>But directional was enough. I&rsquo;d shipped a claim, the claim got measured, and it
didn&rsquo;t survive contact with 30 queries. That&rsquo;s the whole reason I <a href="/posts/gitops-for-my-brain/">keep my
brain in git with everything reproducible</a>: so the
day I bother to measure, the measurement can actually win the argument against
my own confident prose.</p>
<p>The slide-deck version of RAG says <em>add a graph</em>. The benchmark says <em>know which
question you&rsquo;re answering first</em>. I&rsquo;ll take the benchmark.</p>
]]></content:encoded></item></channel></rss>