Local RAG over markdown notes with sqlite-vec and Ollama

A few months ago I wrote about a notes system I built so an AI assistant could read my notes the same way I do. Date-grouped markdown files, every note starts with a ## TL;DR, all version controlled. It worked. The notes kept piling up.

Then I noticed something annoying.

A while back I’d debugged an absence sync issue with one of our third-party school data integrations. Painful one. Took the better part of a day to track down, lots of dead ends. I wrote up the whole thing in a note when it was finally fixed. Months later the same shape of bug came back. I sat there debugging with Claude for an hour, going through the usual suspects, before some part of my brain finally went “wait, haven’t we done this.” The note was right there in ~/Workspace/notes/. I just hadn’t told Claude to look at it. The premise of the original system was that I’d point the assistant at the relevant note. But for that to happen, I have to remember the note exists.

The fix is RAG. Retrieval Augmented Generation. Every time a question comes up, search the corpus, find anything relevant, hand it to the assistant as context. The catch is that most implementations push your data to a cloud vector database and a hosted embeddings API, and I wasn’t going to do that with notes full of client work, PR reviews and half-baked ideas.

So local. One user, one laptop, no network calls.

The pieces

Four things make this work.

SQLite with the sqlite-vec extension for vector search. Ollama running nomic-embed-text locally to turn text into numbers (an “embedding” is just a list of numbers that captures the meaning of a piece of text, so two pieces of text about similar things end up with numerically similar lists). A bit of TypeScript glue. A Claude Code hook and a slash command to make it invisible during normal use.

The code is at github.com/gayanhewa/notes-rag, about 450 lines of TypeScript.

The bit that made this work

Choosing what to feed the embedding model mattered more than any of the code.

A lot of RAG tutorials start with chunking. You take a document, split it into 500-token windows, embed every window, store all of them. With my notes that would have been thousands of vectors, each one diluted by whatever else happened to be in its window.

But every note in my system already starts with a ## TL;DR. Two to five bullets of dense summary that I write for myself, every time. It’s the highest-signal text in the whole note.

So that’s all I embed. One vector per note, generated from the TL;DR. When a search happens, I match against TL;DRs. When something matches, I read the full file from disk.

The obvious blind spot is that if something important lives only in a note’s body and never made the TL;DR bullets, the search will miss it. I’ve watched for this in practice and it hasn’t bitten me yet, mostly because writing the TL;DR is how I figure out what the note is actually about. The day it bites me I’ll add a second tier of body-level embeddings. Until then, one vector per note is enough.

How it actually runs

When a note is created or edited the indexer parses the file, pulls out the frontmatter, grabs the ## TL;DR section, hashes the file so it can skip work on unchanged notes, and sends the TL;DR to Ollama. Ollama returns 768 numbers (the dimension nomic-embed-text happens to use). Those numbers go into a sqlite-vec virtual table. The metadata (path, title, tags, date, TL;DR text, hash) goes into a regular SQLite table linked by ID.

To search, the same model embeds the query, sqlite-vec returns the closest stored vectors, the CLI prints the matches.

The CLI has three commands and that’s the whole interface:

npx tsx src/cli.ts index                          # full reindex
npx tsx src/cli.ts index-one <path>               # one file, used by the hook
npx tsx src/cli.ts search "what did I decide..."  # query

Making it invisible

A CLI on its own doesn’t fix the original problem. I’d still have to remember to run it. The whole point was for the assistant to use this without me having to think about it.

Two integrations did that.

A hook to keep the index fresh

Every time I write or edit a file in my notes folder, the index needs to update. If I forget, the system is just stale.

Claude Code has PostToolUse hooks that fire after a tool call. I added one matching Write|Edit (the matcher is a regex over tool names) and pointed it at a shell script:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [
          {
            "type": "command",
            "command": "~/.claude/hooks/notes-rag-index.sh"
          }
        ]
      }
    ]
  }
}

The script itself is mostly bash defensiveness. Read the JSON Claude Code sends on stdin, check the file path is under my notes folder, check it’s a markdown file, and if so kick off a background reindex of just that file:

#!/bin/bash
INPUT=$(cat)
FILE_PATH=$(echo "$INPUT" | jq -r '.tool_input.file_path // empty')

NOTES_DIR="$HOME/Workspace/notes"
[[ "$FILE_PATH" != "$NOTES_DIR"/* ]] && exit 0
[[ "$FILE_PATH" != *.md ]] && exit 0

cd "$HOME/Workspace/notes-rag" || exit 0
NOTES_RAG_MODEL="${NOTES_RAG_MODEL:-nomic-embed-text:v1.5}" \
  npx --no-install tsx src/cli.ts index-one "$FILE_PATH" >/dev/null 2>&1 &
exit 0

The model tag is pinned because if you ever pull a different version of nomic-embed-text, the new embeddings won’t be comparable to the ones already in the index.

The hook returns immediately. The indexing happens in the background, takes a second, doesn’t block me.

A slash command to query

The other half is letting Claude actually use the search. I wrote a small skill that becomes /notes-recall <query>:

---
name: notes-recall
description: Search past notes in ~/Workspace/notes/ via local RAG. Use when user asks "what did I write about X", "do I have notes on Y", or wants to surface prior notes by meaning rather than filename.
allowed-tools: Bash(npx tsx *), Bash(cd *), Read
---

# Notes Recall

## Workflow

1. Run: cd ~/Workspace/notes-rag && npx tsx src/cli.ts search "<query>"
2. Read the top 1-3 matching files in full
3. Cite the file path inline so the user can open it

If the output says "no matches", say so. Don't invent answers.
If the output starts with "weak signal", treat results as suggestions, not answers.

The repo has the full version, which adds some tuning advice for me. The workflow above is what actually runs.

What I got wrong the first few times

The first thing I got wrong was the similarity threshold. Vector search always returns something. If you ask it for the 5 closest matches, you get 5 results whether or not any of them are relevant. So you set a threshold: matches above some “distance” number get dropped before they’re shown. Smaller distance = closer in meaning.

My first threshold was generous (0.55 on the cosine distance scale, where 0 is identical and 1 is unrelated). Results looked great when I knew there was a matching note. The right one always came back at the top. But the moment I asked about something I’d never written about, the system would still return the least-bad note in my corpus, looking just as confident as a real match. That’s the failure mode that kills any RAG system: you start trusting plausible-looking output that’s wrong.

I tightened it to 0.48. Now if there’s no real match, the CLI says “no matches” and Claude tells me there’s no relevant note. Better to inject nothing than wrong context.

The second thing I got wrong was assuming the threshold alone was enough. Sometimes two notes come back at almost identical scores, where neither is a great match but both clear the bar. You’re back to the same problem: the system has found something but the gap between best and second-best is meaningless.

So I added a check. When the top result’s lead over the second-best is under 0.02, the CLI prints a “weak signal” line above the results, and the skill tells Claude to treat those results as suggestions rather than facts. I trust the system way more now than before this check existed. When Claude says “I’m not sure, but this might be related” instead of “here’s what your notes say,” I believe the output more.

The third thing I got wrong was thinking I’d want retrieval as something Claude calls when it decides it needs it. I switched to the hook approach because it removes a decision. If the assistant has to remember to search, it sometimes won’t. If retrieval is automatic, it happens every time. The cost of a search that returns nothing is basically zero, and skipping a search that would have surfaced a relevant note is exactly the problem I was trying to solve in the first place.

Why local turned out to matter more than I thought

I expected privacy to be the strongest argument for local. It is. The thing I didn’t predict was latency.

Embedding a query against the local Ollama instance takes about 50ms on my laptop. Searching SQLite takes single-digit milliseconds. The whole round trip is fast enough that I don’t notice it. If this went through a hosted embeddings API and a remote vector database, every query would carry a noticeable delay, and the assistant would feel laggier even when the answer was right.

There’s also something I just like about the whole thing being a SQLite file in my home directory. No background service. No subscription. No cloud account I’ll forget about.

What I’d build next

The most obvious next step is wiring search into UserPromptSubmit so every question implicitly searches the notes before Claude sees it. I’ve been holding off until I trusted the threshold behaviour. Now I do, so it’s just plumbing.

Beyond that, nothing urgent. The system already changed how I use my notes. Last week I started typing a question about an integration thing I half-remembered, and Claude opened the matching note from a couple of months ago before I’d finished writing the sentence. That’s the whole game. I write the notes roughly the same way as before, but I no longer have to remember they exist.

Code is at github.com/gayanhewa/notes-rag if you want to look at it. If you haven’t read the original notes post, start there, because the TL;DR convention is what makes any of this actually work.