Evaluation on PG Blog

Evaluating LLM Apps in Java

Sat, 04 Jul 2026 00:00:00 +0000

Introduction

Building Reliable LLM Applications in Java put it plainly: treat model output as a hypothesis to verify, not a fact to trust. Testing Best Practices in Java put the same discipline in JUnit terms: a suite only earns trust by asserting the right things at the right level, unhappy paths included. This post is where those two ideas meet — a JUnit test either passes or fails against a fixed expected value; an LLM’s output is a paragraph of prose that might be right in spirit while differing token-for-token from anything you wrote down in advance. Evaluating it takes a harness, not an assertEquals.

Evaluating LLM Apps in Python

Sat, 04 Jul 2026 00:00:00 +0000

Introduction

Building Reliable LLM Applications in Python put it plainly: treat model output as a hypothesis to verify, not a fact to trust. Testing Best Practices in Python put the same discipline in pytest terms: a suite only earns trust by asserting the right things at the right level, unhappy paths included. This post is where those two ideas meet — a pytest assertion either passes or fails against a fixed expected value; an LLM’s output is a paragraph of prose that might be right in spirit while differing token-for-token from anything you wrote down in advance. Evaluating it takes a harness, not an assert.

Making RAG Accurate in Java

Sat, 04 Jul 2026 00:00:00 +0000

Introduction

RAG From Scratch in Java built a retrieval pipeline out of cosine similarity and a reranking pass, and Vector Databases in Practice for Java moved that same index into pgvector so it can hold millions of chunks. Neither post asked the question that actually decides whether a RAG system is any good: does it retrieve the right chunks, and how would you know?

This post answers that question in two halves. First, three techniques that improve what gets retrieved in the first place — hybrid search that catches what pure vector similarity misses, metadata filtering that narrows the search space before ranking even starts, and chunking choices that shape recall long before a query is ever run. Second, the metrics that turn “this feels better” into a number you can track across a change: recall@k, precision@k, MRR, and nDCG. Everything below is illustrative, non-executed prose code, consistent with the pipeline built in post 20.

Making RAG Accurate in Python

Sat, 04 Jul 2026 00:00:00 +0000

Introduction

RAG From Scratch in Python built a retrieval pipeline out of cosine similarity and a reranking pass, and Vector Databases in Practice for Python moved that same index into pgvector so it can hold millions of chunks. Neither post asked the question that actually decides whether a RAG system is any good: does it retrieve the right chunks, and how would you know?