Making RAG Accurate in Java
Introduction
RAG From Scratch in Java built a retrieval pipeline out of cosine similarity and a reranking pass, and Vector Databases in Practice for Java moved that same index into pgvector so it can hold millions of chunks. Neither post asked the question that actually decides whether a RAG system is any good: does it retrieve the right chunks, and how would you know?
This post answers that question in two halves. First, three techniques that improve what gets retrieved in the first place — hybrid search that catches what pure vector similarity misses, metadata filtering that narrows the search space before ranking even starts, and chunking choices that shape recall long before a query is ever run. Second, the metrics that turn “this feels better” into a number you can track across a change: recall@k, precision@k, MRR, and nDCG. Everything below is illustrative, non-executed prose code, consistent with the pipeline built in post 20.
Making RAG Accurate in Python
Introduction
RAG From Scratch in Python built a retrieval pipeline out of cosine similarity and a reranking pass, and Vector Databases in Practice for Python moved that same index into pgvector so it can hold millions of chunks. Neither post asked the question that actually decides whether a RAG system is any good: does it retrieve the right chunks, and how would you know?
This post answers that question in two halves. First, three techniques that improve what gets retrieved in the first place — hybrid search that catches what pure vector similarity misses, metadata filtering that narrows the search space before ranking even starts, and chunking choices that shape recall long before a query is ever run. Second, the metrics that turn “this feels better” into a number you can track across a change: recall@k, precision@k, MRR, and nDCG. Everything below is illustrative, non-executed prose code, consistent with the pipeline built in post 21.
RAG From Scratch in Java
Introduction
Retrieval-augmented generation (RAG) is the pattern behind almost every “chat with your documents” feature: instead of hoping a model already knows the answer, you find the passages most likely to contain it and hand them to the model as context. Done well, it’s the single highest-leverage technique for keeping an LLM application grounded in facts it didn’t memorize.
This post builds RAG from scratch — no vector database, no framework — so the mechanics are visible end to end: splitting documents into chunks, turning them into embeddings, searching them with nothing more than array math, reranking the results, and generating a grounded answer. It builds directly on the grounding discipline from Building Reliable LLM Applications in Java — “give the model the source material and instruct it to answer only from that material” — by showing where that source material actually comes from. We’ll close with the question every RAG design eventually has to answer honestly: when does retrieval beat simply pasting more into the context window?
RAG From Scratch in Python
Introduction
Retrieval-augmented generation (RAG) is the pattern behind almost every “chat with your documents” feature: instead of hoping a model already knows the answer, you find the passages most likely to contain it and hand them to the model as context. Done well, it’s the single highest-leverage technique for keeping an LLM application grounded in facts it didn’t memorize.
This post builds RAG from scratch — no vector database, no framework — so the mechanics are visible end to end: splitting documents into chunks, turning them into embeddings, searching them with nothing more than array math, reranking the results, and generating a grounded answer. It builds directly on the grounding discipline from Building Reliable LLM Applications in Python — “give the model the source material and instruct it to answer only from that material” — by showing where that source material actually comes from. We’ll close with the question every RAG design eventually has to answer honestly: when does retrieval beat simply pasting more into the context window?
Vector Databases in Practice for Java
Introduction
RAG From Scratch in Java built retrieval with nothing but an array of doubles and a Comparator: cosine similarity computed in a loop, top-k picked with a stream sort. That post said outright that this is a brute-force O(n) scan — fine for a few thousand chunks, the wrong tool once a corpus reaches millions. This post picks up exactly there: how do you store and search vectors at that scale, using Postgres, and when do you need something else entirely?
Vector Databases in Practice for Python
Introduction
RAG From Scratch in Python built retrieval with nothing but a list of floats and sorted(): cosine similarity computed in a loop, top-k picked with a slice. That post said outright that this is a brute-force O(n) scan — fine for a few thousand chunks, the wrong tool once a corpus reaches millions. This post picks up exactly there: how do you store and search vectors at that scale, using Postgres, and when do you need something else entirely?