<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Evaluation on PG Blog</title><link>https://pg-blogs.netlify.app/tags/evaluation/</link><description>Recent content in Evaluation on PG Blog</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Sat, 04 Jul 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://pg-blogs.netlify.app/tags/evaluation/index.xml" rel="self" type="application/rss+xml"/><item><title>Evaluating LLM Apps in Java</title><link>https://pg-blogs.netlify.app/posts/30-evaluating-llm-apps-in-java/</link><pubDate>Sat, 04 Jul 2026 00:00:00 +0000</pubDate><guid>https://pg-blogs.netlify.app/posts/30-evaluating-llm-apps-in-java/</guid><description>&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://pg-blogs.netlify.app/posts/11-building-reliable-llm-apps-in-java/"&gt;Building Reliable LLM Applications in Java&lt;/a&gt; put it plainly: &lt;strong&gt;treat model output as a hypothesis to verify, not a fact to trust.&lt;/strong&gt; &lt;a href="https://pg-blogs.netlify.app/posts/16-testing-best-practices-in-java/"&gt;Testing Best Practices in Java&lt;/a&gt; put the same discipline in JUnit terms: a suite only earns trust by asserting the right things at the right level, unhappy paths included. This post is where those two ideas meet — a JUnit test either passes or fails against a fixed expected value; an LLM&amp;rsquo;s output is a paragraph of prose that might be &lt;em&gt;right in spirit&lt;/em&gt; while differing token-for-token from anything you wrote down in advance. Evaluating it takes a harness, not an &lt;code&gt;assertEquals&lt;/code&gt;.&lt;/p&gt;</description></item><item><title>Evaluating LLM Apps in Python</title><link>https://pg-blogs.netlify.app/posts/31-evaluating-llm-apps-in-python/</link><pubDate>Sat, 04 Jul 2026 00:00:00 +0000</pubDate><guid>https://pg-blogs.netlify.app/posts/31-evaluating-llm-apps-in-python/</guid><description>&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://pg-blogs.netlify.app/posts/10-building-reliable-llm-apps-in-python/"&gt;Building Reliable LLM Applications in Python&lt;/a&gt; put it plainly: &lt;strong&gt;treat model output as a hypothesis to verify, not a fact to trust.&lt;/strong&gt; &lt;a href="https://pg-blogs.netlify.app/posts/17-testing-best-practices-in-python/"&gt;Testing Best Practices in Python&lt;/a&gt; put the same discipline in pytest terms: a suite only earns trust by asserting the right things at the right level, unhappy paths included. This post is where those two ideas meet — a pytest assertion either passes or fails against a fixed expected value; an LLM&amp;rsquo;s output is a paragraph of prose that might be &lt;em&gt;right in spirit&lt;/em&gt; while differing token-for-token from anything you wrote down in advance. Evaluating it takes a harness, not an &lt;code&gt;assert&lt;/code&gt;.&lt;/p&gt;</description></item><item><title>Making RAG Accurate in Java</title><link>https://pg-blogs.netlify.app/posts/24-making-rag-accurate-in-java/</link><pubDate>Sat, 04 Jul 2026 00:00:00 +0000</pubDate><guid>https://pg-blogs.netlify.app/posts/24-making-rag-accurate-in-java/</guid><description>&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://pg-blogs.netlify.app/posts/20-rag-from-scratch-in-java/"&gt;RAG From Scratch in Java&lt;/a&gt; built a retrieval pipeline out of cosine similarity and a reranking pass, and &lt;a href="https://pg-blogs.netlify.app/posts/22-vector-databases-in-practice-for-java/"&gt;Vector Databases in Practice for Java&lt;/a&gt; moved that same index into pgvector so it can hold millions of chunks. Neither post asked the question that actually decides whether a RAG system is any good: &lt;strong&gt;does it retrieve the right chunks, and how would you know?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This post answers that question in two halves. First, three techniques that improve what gets retrieved in the first place — hybrid search that catches what pure vector similarity misses, metadata filtering that narrows the search space before ranking even starts, and chunking choices that shape recall long before a query is ever run. Second, the metrics that turn &amp;ldquo;this feels better&amp;rdquo; into a number you can track across a change: recall@k, precision@k, MRR, and nDCG. Everything below is illustrative, non-executed prose code, consistent with the pipeline built in post 20.&lt;/p&gt;</description></item><item><title>Making RAG Accurate in Python</title><link>https://pg-blogs.netlify.app/posts/25-making-rag-accurate-in-python/</link><pubDate>Sat, 04 Jul 2026 00:00:00 +0000</pubDate><guid>https://pg-blogs.netlify.app/posts/25-making-rag-accurate-in-python/</guid><description>&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://pg-blogs.netlify.app/posts/21-rag-from-scratch-in-python/"&gt;RAG From Scratch in Python&lt;/a&gt; built a retrieval pipeline out of cosine similarity and a reranking pass, and &lt;a href="https://pg-blogs.netlify.app/posts/23-vector-databases-in-practice-for-python/"&gt;Vector Databases in Practice for Python&lt;/a&gt; moved that same index into pgvector so it can hold millions of chunks. Neither post asked the question that actually decides whether a RAG system is any good: &lt;strong&gt;does it retrieve the right chunks, and how would you know?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This post answers that question in two halves. First, three techniques that improve what gets retrieved in the first place — hybrid search that catches what pure vector similarity misses, metadata filtering that narrows the search space before ranking even starts, and chunking choices that shape recall long before a query is ever run. Second, the metrics that turn &amp;ldquo;this feels better&amp;rdquo; into a number you can track across a change: recall@k, precision@k, MRR, and nDCG. Everything below is illustrative, non-executed prose code, consistent with the pipeline built in post 21.&lt;/p&gt;</description></item></channel></rss>