Prompt Caching and Cost Control in Java
Introduction
We already covered picking the right model tier for the task and caching a large shared prefix in https://pg-blogs.netlify.app/posts/11-building-reliable-llm-apps-in-java/. Those two lines were the tip of a bigger discipline: LLM cost is not a fixed line item, it’s an engineering variable — one you can measure and shrink with the same rigor you’d apply to database query time or container memory.
This post goes deeper: how input/output pricing actually works, the exact cache_control shape and how to prove a cache hit rather than assume one, the Batches API for work that isn’t latency-sensitive, and model routing — using a cheap model to triage, escalating only the hard cases to a stronger one. The honest framing throughout: measure before you optimize. Every technique here has a cost of its own; applied to the wrong workload, “optimization” makes things slower or more expensive.
Prompt Caching and Cost Control in Python
Introduction
https://pg-blogs.netlify.app/posts/10-building-reliable-llm-apps-in-python/ closed with a section on picking the right model per task and caching a shared prefix. That was the entry point into a bigger discipline: LLM spend is an engineering variable, not a fixed bill — one you can measure and reduce with the same rigor you’d apply to query latency or memory footprint.
This post goes deeper on four levers: how input/output pricing actually works and why the prefix is usually where the money goes, the exact cache_control shape and how to prove a cache hit instead of assuming one, the Batches API for work that isn’t latency-sensitive, and model routing — a cheap model triaging requests and escalating only the hard ones. The throughline is honest: measure before you optimize. Every lever here has its own cost; misapplied, it makes things slower or pricier, not cheaper.