Tech Blog

A RAG with only vector search "forgets when it matters most" — taking recall from 0.2 to 1.0 with hybrid search + measurement

RAG ChromaDB Ollama ハイブリッド検索 BM25 リランカー 測定駆動 recall Claude Code 個人ナレッジ管理 GPU

🔗 Series Table of Contents: This article is the Measurement-Driven Tuning Edition of the “AI Assistant Operations Notes” series. Reading the Design Philosophy Edition first makes it clear why the structure is the way it is.

What you’ll learn

  • The real nature of the problem where a RAG you “designed as a companion” doesn’t remember when you actually use it
  • How to tell that it’s weak retrieval, not a design flaw (so you don’t fall into the redesign trap)
  • The “recovery of un-ingested data” you should do before polishing search
  • How to fix search while measuring recall with a golden set (no guessing)
  • How recall moved with hybrid search, diversification, and priority-bias correction
  • The value of deciding to “reject” by measurement (the courage to drop something once you’ve measured it gets worse)
  • The “lexical gap” that remains at the end, the GPU story, and the punchline
  • How improving search turned out to be continuous with data ingestion, evaluation design, runtime architecture, and infrastructure (GPU distribution) design

Who this is for

  • People who actually use a personal RAG / PKM and feel it just doesn’t retrieve well
  • People who built a RAG with “vector search for now” and left it that way
  • People who have fallen into a swamp by tuning search on gut feeling
  • People who drew up a plan to distribute RAG processing but stalled because the hardware that mattered wouldn’t run (that’s me this time)

Introduction: My companion went silent at the crucial moment

Last time, I arrived at the design principles for turning my homemade personal RAG into a “companion you can live with for 5 years”: keep raw notes without throwing them away, treat distillation as reinforcement rather than replacement, and an Activation layer that injects relevant memories every turn.

🔗 Previous article: The night when AI corrected me 5 times — Design philosophy to make personal RAG your companion for 5 years

I was happy with the design.

But as I used it every day, one day I thought this:

“You don’t remember when it matters, do you?”

Things I had definitely talked about, lessons I had definitely recorded. In the very moment I wanted them pulled into the current context, my companion would casually serve up something else. Or it would only bring up recent things and forget an important conclusion from a little while back.

The design was supposed to be good, yet the experience was a “companion with a poor memory.”

Here is a mistake many people make: assuming “the design is bad” and starting to rebuild. In the previous article, my AI and I stepped on the “redesign trap” over and over. So this time, I doubted that first.

Isn’t this weak retrieval (search), not an architectural flaw?

The two-layer structure, Capture-first, Activation — the skeleton was sound, exactly as I’d hammered it out last time. What needs fixing is search quality, not a redesign. This separation was the starting point this time.

When something goes wrong, identify “which layer the problem is in” before rebuilding. Whether it’s a personal RAG or a large system you’re entrusted with at work, you separate the layers first and then act — that order doesn’t change.


Act 1: Before polishing search — suspect “it was never even in there”

Just as I was fired up to improve search, I noticed an unpleasant possibility.

Maybe it can’t be retrieved not because search is bad, but because the data was never ingested in the first place?

I checked, and bingo. The pipeline that ingested conversations was quietly truncating the second half of long messages. On top of that, when the AI answered in several parts with tool calls in between, everything from the second response onward was dropped entirely.

In other words, the denser a past discussion was, the more its crucial part simply didn’t exist in the index. “It doesn’t come up in search” was only natural — there was nothing there to begin with.

What was happening during ingestion (illustration)

Long message:  ┌──────────────────────────────────────────┐
               │ ●●●●●●●●  the core of the design is here →│
               └──────────────────────────────────────────┘
  Before fix:  [●●●●●●●●] ✂   ← truncated here; the second half vanishes
  After fix:   [●●●●][●●●●]    ← split it up and index everything

The AI sometimes answers in pieces: "reply → tool run → continued reply":

  Real user message
   ├ AI reply (1)
   ├ tool run (search or code) → its result
   ├ AI reply (2)   ← continuation after the result
   └ AI reply (3)   ← further continuation
  Next real user message     ← "one answer" ends here

  Before fix: ingested only reply (1); dropped (2)(3)
              (mistook the tool result for "the next user message" and cut off)
  After fix:  concatenate and ingest (1)+(2)+(3) up to the next real user message

It’s unglamorous, but it was a decisive lesson.

No matter how much you polish search, you can’t retrieve what isn’t in the source data. Suspect capture before retrieval.

I fixed the boundary detection in ingestion, stopped the truncation, and re-ingested every conversation. The lost past exchanges visibly came back. The search story starts in earnest from here.


Act 2: Don’t fix by gut — fix “while measuring” with a golden set

This is the part I most want to convey this time.

Improving search, if you do it by gut, will always drag you into a swamp. “It feels somewhat better” is usually a placebo, or it’s breaking something else.

So I built a small golden set myself (a set of correct pairs of “for this query, this record should come up”). Then, on every change, I measure recall (the fraction where the correct answer is in the top N). I look at before/after as numbers. This alone changes the world.

The first score I measured for plain vector search was, honestly, terrible. It could pull semantically close things, but it dropped them as soon as the phrasing differed slightly. recall@10 was 0.2 — meaning out of 10 returned, only 2 came anywhere near correct.

The moves were textbook RAG, but I added them one at a time and measured one at a time.

Search pipeline (final-form illustration)

  Query

   ├─▶ Vector search (semantically close) ──┐
   │                                        ├─▶ Fuse ranks (RRF) ─▶ Diversify ─▶ Priority adjust ─▶ Result
   └─▶ Lexical search BM25 (surface-close) ─┘

        └ Japanese inflection variation is caught by breaking text into 1–2 character units
  • Hybrid search: run semantic vector search alongside surface-level lexical matching (BM25-style), and fuse the two rankings. I made it so Japanese inflection variation (e.g., 「遅い」osoi “slow” and 「遅すぎ」ososugi “too slow”) is caught on the surface side. This was the most effective.
  • Diversification: so the top isn’t filled with “nearly identical content,” mix in a fixed quota of slightly distant candidates too. This is to protect, on the search side, “the design that tells you about the base of Mt. Fuji” that I wrote about in the previous article.
  • Priority-bias correction: I had been pinning records marked “absolutely never forget” to the very top, but as those grow, they push aside records that are truly close to the current query. I stopped the pinning and balanced it to merely nudge the rank up a bit.

To make the “catch on the surface” side for Japanese a bit more concrete: rather than looking at words as-is, you break the text into sequences of 1–2 characters (n-grams) and then match. This way, even when the inflection changes, the surface partially overlaps and BM25 can catch it.

# Catch Japanese inflection variation on the surface side: break text into 1–2 grams, feed to BM25
def char_ngrams(text, n_min=1, n_max=2):
    grams = []
    for n in range(n_min, n_max + 1):
        grams += [text[i:i + n] for i in range(len(text) - n + 1)]
    return grams
# 「遅い」  (osoi)    → 遅, い, 遅い
# 「遅すぎ」(ososugi) → 遅, す, ぎ, 遅す, すぎ
# The shared 「遅」 brushes the surface, so a hit lands even when the inflection differs

As a result, the golden set’s recall@10 rose from 0.2 to 1.0. Crowding (the phenomenon where flagged records occupy the top) also improved greatly in measurement.

A little about the internals. The key to fusion is to mix by “rank,” not by absolute score (distance and BM25 scores are on different scales and can’t be added, but ranks can).

# RRF: fuse two rankings by "rank" (simplified; values are examples)
def rrf(rankings, k=60):                 # k is a stabilizing constant
    score = {}
    for ranking in rankings:             # ranking = a list of ids in top-down order
        for rank, doc_id in enumerate(ranking):
            score[doc_id] = score.get(doc_id, 0) + 1 / (k + rank + 1)
    return score                         # id -> score (larger is higher)

And here is the core of “don’t do it by gut.” On every change, run the same golden set and look at only the difference in the numbers.

# Measure recall@N with a golden set (simplified)
def recall_at_n(golden, search, n=10):
    hit = 0
    for query, expected_id in golden:    # (query, the record that should come up)
        top = [doc.id for doc in search(query)][:n]
        hit += 1 if expected_id in top else 0
    return hit / len(golden)

print(recall_at_n(golden, baseline_search))  # e.g. 0.2  (plain vector search)
print(recall_at_n(golden, hybrid_search))    # e.g. 1.0  (hybrid + diversification)

※ The code shown is a simplified version that conveys the concept; the actual parameters and structure are a bit more involved. The two key points are “fuse by rank” and “compare by number.”

Insight: the value of search improvement is in the “measured difference,” not the “move you made.” Fixing while watching recall makes effective and ineffective moves obvious at a glance.

With numbers, the discussion shifts from “preference” to “fact.” This was the same in design reviews too.


Act 3: The courage to decide “reject” by measurement

The tastiest part of being measurement-driven is that you can confirm a bad move is bad.

One move I had hopes for was “query expansion.” Before searching, use an LLM to rephrase the query into different wordings and search with several variations. It should be more robust to lexical mismatch — that was the plan.

I tried it and measured. It got worse. The rephrasing subtly shifted the meaning (drift), causing it to drop correct answers instead. And it was slow. With a lightweight local LLM, neither quality nor speed was worth it.

Another, “switch to a stronger embedding model,” I actually built and ran. But on my local CPU, 4 out of 5 cases failed to process, and even the ones that passed took several seconds each. With that cost riding on every search, it wasn’t realistic for a companion used in real time.

Both, I rejected after measuring. The code remains, clearly labeled “measured net-negative, not wired to production” (this kind of failure process is precisely what’s worth looking back on later).

Here, one operational principle solidified. It’s not about whether a feature is good or bad; it’s a judgment to protect speed and experience — in other words, a non-functional-requirements design.

Insight: don’t put heavy processing on the path of search (reads) that runs in real time. Heavy things go to the batch behind the scenes. Break this and even if quality rises, the experience dies.

On which path do you put heavy processing?

  ┌─ Live path (read / real time) ───────────────────────────┐
  │  Query → light search → return fast  ← no heavy work here │
  └──────────────────────────────────────────────────────────┘
  ┌─ Batch path (write / behind the scenes) ─────────────────┐
  │  ingestion / distillation / heavy preprocessing  ← OK to take time │
  └──────────────────────────────────────────────────────────┘

  Both query expansion (LLM rephrasing) and heavy embedding didn't fit the live path → rejected

The reason I can confidently say “I tried it → it got worse → so I dropped it” is that I’m measuring. By gut, “I added something but it feels like it’s working” lingers forever.


Act 4: Same meaning, but different words won’t retrieve — the lexical gap

For normal queries, recall came up to nearly full marks. But for a deliberately hard golden set — cases where the query and the record I want to pull share the same meaning yet share no words at all — it stubbornly stayed at 0.2. This is what I call the “lexical gap” in this article.

For example:

  • Hearing “yaru-ki ga tsuzuku” (staying motivated), I want to pull the record “mochibēshon iji” (sustained motivation)
  • Hearing “nakama to chikara wo awaseru” (joining forces with teammates), I want to pull the record “chīmuwāku” (teamwork)

A human instantly knows “they’re saying the same thing.” But in Japanese, the surfaces don’t share a single character (one side is native Japanese words, the other a katakana loanword), and a lightweight embedding can’t reach across either. This was the last wall of a RAG built in a local environment.

The lexical gap: the meaning is the same, yet the words share not one character

  Query:  "yaru-ki ga tsuzuku" (staying motivated)
            │   surface overlap = zero

  Record: "mochibēshon iji" (sustained motivation)

  Vector search … should be semantically close, but a light embedding can't reach   ✗
  BM25 (surface) … no shared characters, so it can't pick it up at all              ✗
  ──────────────────────────────────────────────────────────────────────────────
  → only a cross-encoder reranker / a strong multilingual embedding bridges it

I know the remedies. Reorder the top with a cross-encoder reranker, or switch to a strong multilingual embedding. Both are known to work.

However — both require a high-performance GPU to run at a practical speed. On CPU, as I actually measured in Act 3, each item takes several seconds. With every search waiting several seconds, you can’t put it on the path of a companion that responds in real time.

And from here, the story rolls in a slightly strange direction.


Act 5: The plan to distribute RAG processing, and the reality of desktops that won’t run

I need a high-performance GPU. But making one machine carry it all is heavy. So why not split the roles across the several GPU machines at home — and with that, I cheerfully began sketching that distributed setup.

Distribute heavy processing like embedding and reranking across the several GPU machines at home, and have the laptop stay strictly an operator terminal. Since only a few KB of query and result flow over the LAN, it won’t be a bottleneck — I even drew the diagram and was rather pleased with myself.

Full of enthusiasm, I pressed the power button on a desktop.

It didn’t run.

The other one was the same. Both desktops, untouched for a long time, wouldn’t boot when powered on — just five beeps. I’m guessing it’s CPU thermals, and I plan to swap in a new closed-loop cooler and PSU, but a firm diagnosis is still pending. Either way, the plan to distribute RAG processing stopped in front of the silly reality that the machines that matter both won’t boot.

Here, for a moment, I pinned my hopes on the laptop at hand.

True, this laptop also has an RTX 2070 in it. Why not just try it here — for a moment, I thought so.

But I quickly reconsidered. This laptop is the very work machine I run VSCode and Claude Code on every day. The role I gave the laptop in the original distributed setup was the side that throws queries and receives results — the operator. Heavy processing is entrusted to the desktop-side GPUs, and the machine at hand is kept light. This isn’t a compromise; it’s a placement I chose deliberately.

If I make my everyday work tool permanently carry an embedding server and reranker, the work itself gets heavy. So to truly take on this wall, I still need to fix the desktops and build the original distribution first. There’s a range I can try on the laptop, but that implementation — incorporating the reranker and so on — is still ahead.

The processing-distribution plan in my head (split roles across several home machines)

  [Machine A: Ollama (embedding + LLM)]
  ┌──────────────────────────────────┐
  │  Ryzen 7 2700X   +   RTX 2060     │
  └──────────────────────────────────┘
   ※ Both embedding and LLM models are light, so 6GB is enough. Chosen for speed over VRAM:
     Turing's RTX 2060 rather than a Pascal GTX
                 │ LAN

  [Machine B: main / ChromaDB (CPU-centric)]
  ┌──────────────────────────────────┐
  │  Core i7 9900K   +   GTX 1080Ti   │
  └──────────────────────────────────┘
   ※ ChromaDB's work is CPU-centric. The idle GPU is given the reranker
     that reorders results to raise precision
                 │ LAN

  [Laptop: client]
  ┌──────────────────────────────────┐
  │  Core i7-9750H   +   RTX 2070     │
  └──────────────────────────────────┘

      ✗ Both desktops won't boot with the same symptom


  Distribution is on hold. For now, from the range I can try on the laptop (work machine).
  The real distribution comes after the desktops are fixed (implementation still ahead)

Just to add: this placement is not a thrown-together hunch. Embedding and LLM models are light, so they go to the RTX 2060 — fast on Turing even at 6GB. ChromaDB’s main work is CPU-centric, so it goes to the 1080Ti machine, where the spare GPU can host the reranker. The laptop stays strictly the operator, and only a few KB of query and result flow over the LAN — a genuine design decision that matched roles to each piece of hardware’s strengths and weaknesses. What started as fixing RAG search had, before I knew it, become infrastructure design about where to place the network and the hardware.

And this way of thinking — “split heavy roles and put each on the hardware that’s good at it” — works harder the more performance you demand of AI. As the scale changes, the tooling for scaling and orchestration changes, but the design thinking lifts straight over: the judgment I stepped through on a desk was continuous with the processing distribution of a large system.

Even its first step — abstracting the embedding’s connection target with an environment variable — can be designed without waiting for the hardware to recover. If you don’t hardcode the target, the “same code” works whether on one machine or distributed; you just switch where it points (implementation still ahead).

# Abstract the embedding's connection target with an environment variable (design; to be implemented)
# With the "same code," one machine or distributed, you only swap where it points
import os

OLLAMA_BASE_URL = os.environ.get("OLLAMA_BASE_URL", "http://localhost:11434")
# Laptop alone → localhost
# Distributed  → just point it at http://<RTX 2060 machine>:11434

Bringing together PCs a few generations old, splitting roles, and running RAG efficiently — that processing-distribution plan is sealed for now. The desktop PCs should run once I swap parts, so once they’re fixed I plan to build the original distributed setup.


Aside: why did I have GPUs?

Lastly, a little about the environment.

Local RAG tuning using a GPU is powerful. Without sending anything to the cloud, you can measure and iterate at hand any number of times. Acts 2–4 were possible because there was a decent local GPU.

These GPUs weren’t bought for RAG. I love cars and built a serious racing simulator, and being a fan of new things, I even reached for PC-connected VR (Oculus Rift 2) — they were stacked up for hobbies like that, and they just happened to be useful for this tuning. Neither runs pleasantly without a decent GPU.

If you have a gaming PC sleeping at hand, that’s a fine experimental setup. With a GPU, local LLM/RAG can be tried casually, any number of times. In fact, the Act 2–4 tuning ran on that one laptop.


Conclusion: the quiet strength of fixing by measuring

There was nothing flashy here. Lining up what I did, it’s this:

  • Before search, I suspected ingestion (you can’t retrieve what isn’t in the source)
  • I built a golden set and fixed while measuring recall (no guessing)
  • I added effective moves and measured-then-dropped ineffective ones (the courage to confirm rejection)
  • I didn’t put heavy processing on the real-time path (don’t kill the experience for quality)
  • In the end I hit the GPU wall and sketched the distribution I really need, but the desktops that matter wouldn’t boot, so I stalled (with a punchline)

These, I think, are principles that don’t age no matter what the specific tool is. Even if ChromaDB changes to another vector DB, or Ollama to another runtime, “fix while measuring,” “don’t put heavy things on the live path,” “assign heavy processing to the right hardware and abstract the connection target so the same code runs on one machine or distributed,” and “do as much as you can at hand, and when that’s not enough, scale up honestly” will probably keep working for a long time.

A companion isn’t smart from the start. Measure, fix, and measure again. Only that unglamorous back-and-forth slowly turns a forgetful companion into one that “remembers for you.”

The “lexical gap” that remained at the end is still uncrossed. To cross it, I need a distributed setup that properly runs a strong embedding and a reranker. That comes after I fix the desktops and build it. For now I’ll work from the range I can try on the laptop, but the real thing comes after — that’s my honest current position.

Roadmap — current position

  ✅ Done
     ├ Fixed ingestion gaps (D1/D5) → re-ingested all conversations
     ├ Hybrid search (vector + BM25 + RRF)
     ├ Diversification (MMR) / priority-bias correction
     ├ Measured recall with a golden set (0.2 → 1.0)
     └ Implemented query expansion / strong embedding (CPU) → measured → rejected

  ⬜ Not done (what's left to cross the lexical gap)
     ├ Make the embedding connection target an env var (prep to point at a GPU)
     ├ Incorporate the reranker (cross-encoder)
     ├ Re-measure strong embedding + reranker on GPU → cross the lexical gap
     └ The real distributed setup (repair desktops → split roles)

It’s not “build it and you’re done”

A RAG is not “tune it once and it’s complete.” When you run into a scene where it doesn’t retrieve well, first investigate what’s happening and analyze the cause. Then patch the place that needs fixing — sometimes stepping into the design itself — and measure recall again.

Keeping this investigate → analyze → improve loop turning without stopping is, I think, what it really means to raise a companion. This lexical gap, too, is just one stop along the way. Cross one and the next wall comes into view. When it does, I’ll face it the same way.

Feel free to send a message

Please send a message if you have any feedback or inquiries.