Posts

RAG From Scratch in Java

Introduction

Retrieval-augmented generation (RAG) is the pattern behind almost every “chat with your documents” feature: instead of hoping a model already knows the answer, you find the passages most likely to contain it and hand them to the model as context. Done well, it’s the single highest-leverage technique for keeping an LLM application grounded in facts it didn’t memorize.

This post builds RAG from scratch — no vector database, no framework — so the mechanics are visible end to end: splitting documents into chunks, turning them into embeddings, searching them with nothing more than array math, reranking the results, and generating a grounded answer. It builds directly on the grounding discipline from Building Reliable LLM Applications in Java — “give the model the source material and instruct it to answer only from that material” — by showing where that source material actually comes from. We’ll close with the question every RAG design eventually has to answer honestly: when does retrieval beat simply pasting more into the context window?

Posts

RAG From Scratch in Python

Introduction

This post builds RAG from scratch — no vector database, no framework — so the mechanics are visible end to end: splitting documents into chunks, turning them into embeddings, searching them with nothing more than array math, reranking the results, and generating a grounded answer. It builds directly on the grounding discipline from Building Reliable LLM Applications in Python — “give the model the source material and instruct it to answer only from that material” — by showing where that source material actually comes from. We’ll close with the question every RAG design eventually has to answer honestly: when does retrieval beat simply pasting more into the context window?