I have a note in my second brain that I wrote months ago. It says, with the confidence of someone who hadn’t measured anything:

Combining lexical search (BM25) with vector similarity and graph expansion produces more robust recall than embeddings alone.

That sentence shipped into production. My vault of markdown notes gets indexed into a search database, and the search function fuses three signals: BM25 (classic keyword ranking), vector similarity (embeddings), and graph expansion — when a note matches, pull in its linked neighbours too, on the theory that the thing you want is often next to the thing you typed.

It sounds right. Graphs are having a moment in RAG. “Add a knowledge graph to your retrieval” is the kind of thing you can put on a slide and nobody pushes back. I believed it enough to make graph expansion a first-class signal with a weight of 0.5 — equal footing with keyword matching.

This week I finally wrote a benchmark. The graph wasn’t helping. It was the single biggest thing hurting my search.

The setup

30 gold queries against the live vault (63 notes), borrowing the harness shape from an eval framework I’d been reading. Each query has a hand-labelled “correct” note. I measured recall@5 (did the right note land in the top 5?) and MRR (how high did it rank?), across three retrievers:

  • grep — naive substring term-count. The dumb floor.
  • bm25 — pure keyword ranking, FTS5’s BM25. The honest baseline.
  • live — my production hybrid (BM25 + vector + graph).

I expected a clean staircase: grep at the bottom, bm25 in the middle, my clever hybrid on top. That’s the whole reason you build the clever thing.

The scorecard

retrieverrecall@5MRR
grep0.4670.307
bm250.9500.826
live (hybrid, w_graph=0.5)0.6500.520

Read that bottom row twice. My production “smart” search found the right note 65% of the time. Plain keyword search found it 95% of the time. The hybrid I’d been quietly proud of was worse than its own baseline — it broke 9 of 30 queries that BM25 got right. BM25 alone whiffed on exactly one.

The clever layer wasn’t adding intelligence. It was adding noise, confidently.

Why the graph backfired

Here’s the mechanism, and it’s almost funny once you see it.

Graph expansion pulls in a matched note’s neighbours. But in a real knowledge base, the most connected notes are hubs — my inbox of ideas, my project radar, my “things Claude noticed” log. Everything links to them, so they’re everyone’s neighbour. When I searched for something specific, the graph helpfully dragged these popularity-contest winners into the candidate set, and they elbowed the genuinely relevant note clean out of the top 5.

Concrete example. Query: “who owns this knowledge system?” The correct answer is my personal note. BM25 ranked it #5 — just barely in. The hybrid, drunk on graph neighbours, pushed it off the list entirely. The graph didn’t find a better answer. It buried a good one under hubs.

I swept the graph weight to confirm it wasn’t a fluke. It was perfectly monotonic — every increment of graph made search worse:

graph weightrecall@5MRR
0.0 (off)0.9500.826
0.10.9500.737
0.250.8170.564
0.5 (what I shipped)0.6500.520

There’s no ambiguity to argue with. More graph, more harm, no exceptions. The value I’d been claiming in that confident note — I finally measured it, and it was negative.

The fix, and the actual lesson

The fix was one line: drop the default graph weight from 0.5 to 0.1. Recall snapped back to 0.95, tying pure BM25. (Turning the graph fully off is marginally better still on MRR; I kept a whisper of it as a tiebreaker, which is a taste call, not a data-driven one.)

But the one-line fix isn’t the point. The point is where graphs belong.

Graph expansion isn’t a bad idea — I aimed it at the wrong job. Precision retrieval (“find me the one note that answers this”) wants to be narrow and literal. Pulling in neighbours is the opposite of what you want; every neighbour is a chance to be wrong. But I have a different feature in this same system — a discovery mode that deliberately collides distant notes to surface unexpected connections. There, neighbour-pulling isn’t noise, it’s the entire product.

Same mechanism. One context it’s poison, the other it’s the point. I’d been running my discovery tool inside my lookup tool and calling it a hybrid.

A few honest caveats, because a benchmark you can’t poke holes in is usually lying: my gold set is self-authored v1, the corpus is small (63 notes), and the vector signal was actually dark during this run — I hadn’t built the embeddings yet, so “hybrid” here was really “BM25 + graph.” The vector half of my original claim is still untested. This is directional, not gospel.

But directional was enough. I’d shipped a claim, the claim got measured, and it didn’t survive contact with 30 queries. That’s the whole reason I keep my brain in git with everything reproducible: so the day I bother to measure, the measurement can actually win the argument against my own confident prose.

The slide-deck version of RAG says add a graph. The benchmark says know which question you’re answering first. I’ll take the benchmark.