Core Concepts
Content Generation
How the generate-judge-improve eval loop works: setting up generation prompts, gold standards, judge.md, running evals via POST /eval/run, and tracking budget and provenance.
Content Generation
The eval loop generates content against the knowledge graph, scores it against a judge, and iterates until a quality threshold is met or a maximum iteration count is reached. No content ships until it passes.
The loop is implemented in @sourcepress/ai as EvalRunner. Budget is tracked per-run and per-day via BudgetTracker.
How the Loop Works
generate → judge → [score >= threshold] → done
↘ [score < threshold] → improve prompt → generate → judge → ...
Each iteration:
- Generate — calls the
generateAI function with the current prompt and the knowledge graph context. - Judge — calls the
judgeAI function, which scores the output against the gold standard defined injudge.md. - Evaluate — if the score meets
evals.threshold, the loop exits and the content is submitted for approval. If not, theimprove promptfunction rewrites the generation prompt and the loop repeats. - Cap — if
max_iterationsis reached before the threshold is met, the loop exits with the best result so far and records the final score.
File Structure
A typical eval setup lives alongside your content collection:
evals/
blog-post/
prompt.md # generation prompt
gold.md # gold standard (example of ideal output)
judge.md # scoring rubric for the judge model
prompt.md
The generation prompt. Written in plain text or Markdown. Reference knowledge graph entities using {{entity.slug}} interpolation — the runner resolves these before sending to the model.
Write a 600-word blog post about {{entity.vector-databases}} for an audience of
backend engineers. Cover use cases, tradeoffs, and when not to use one.
Tone: direct, no marketing language.
gold.md
An example of ideal output. The judge uses this as a reference when scoring. It does not need to be the exact target — it establishes the quality bar.
# When to Use a Vector Database (And When Not To)
Vector databases store embeddings and retrieve by similarity...
[full example output]
judge.md
A scoring rubric. The judge model reads this file and uses it to produce a numeric score between 0 and 1, plus a structured critique.
Score the generated content on the following criteria. Return a score from 0.0 to 1.0
and a brief critique for each dimension.
- Accuracy (0–0.4): All claims are supported by the knowledge graph. No hallucinations.
- Tone (0–0.3): Direct, no filler, no marketing language.
- Coverage (0–0.3): Covers use cases, tradeoffs, and when not to use the technology.
Final score = sum of dimension scores.
Running an Eval
POST /eval/run
Triggers the generate-judge-improve loop for a given eval configuration.
Request body:
{
"collection": "blog-post",
"slug": "vector-databases",
"eval_dir": "evals/blog-post",
"threshold": 0.85,
"max_iterations": 5
}
| Field | Type | Required | Description |
|---|---|---|---|
collection | string | Yes | Target content collection. |
slug | string | Yes | Slug for the output file. Must match /^[a-z0-9-]+$/. |
eval_dir | string | Yes | Path to the directory containing prompt.md, gold.md, and judge.md. |
threshold | number | No | Score (0–1) required to exit the loop. Defaults to evals.threshold in config. |
max_iterations | number | No | Maximum loop iterations before exiting with best result. |
Response (success):
{
"status": "passed",
"score": 0.91,
"iterations": 3,
"content": "...",
"provenance": {
"knowledge_files": ["knowledge/vector-databases.md", "knowledge/embeddings.md"],
"model": "claude-opus-4-5",
"judge_score": 0.91,
"iterations": 3,
"eval_dir": "evals/blog-post",
"generated_at": "2026-04-05T14:22:10Z"
},
"budget": {
"tokens_used": 18420,
"daily_remaining": 481580
}
}
Response (threshold not met):
{
"status": "below_threshold",
"score": 0.78,
"iterations": 5,
"content": "...",
"provenance": { ... },
"budget": { ... }
}
A below_threshold response still returns the best content produced. Whether to submit it for approval is your decision.
Configuration
Set defaults in your SourcePress config:
import { defineConfig } from "@sourcepress/core";
export default defineConfig({
evals: {
threshold: 0.85,
max_iterations: 5,
model: "claude-opus-4-5",
judge_model: "claude-opus-4-5",
},
budget: {
daily_token_limit: 500000,
},
});
| Key | Type | Description |
|---|---|---|
evals.threshold | number | Score (0–1) required to pass. Applied to all runs unless overridden per-request. |
evals.max_iterations | number | Loop cap. Prevents runaway spend on a single generation. |
evals.model | string | Model used for generation and prompt improvement. |
evals.judge_model | string | Model used for judging. Can differ from the generation model. |
budget.daily_token_limit | number | BudgetTracker enforces this limit across all AI calls in a calendar day. |
Provenance Metadata
Every file that exits the eval loop carries a provenance block. This is written into the output file before it is submitted as a GitHub PR.
# _provenance.yaml (embedded in output frontmatter or sidecar)
knowledge_files:
- knowledge/vector-databases.md
- knowledge/embeddings.md
model: claude-opus-4-5
judge_score: 0.91
iterations: 3
eval_dir: evals/blog-post
generated_at: 2026-04-05T14:22:10Z
approved_by: null # populated after PR merge
Provenance records which knowledge files informed the content, which model generated it, what score the judge assigned, and how many iterations were required. After a PR is merged via GitHubPRApprovalProvider, approved_by is populated with the approving identity.
This structure satisfies EU AI Act traceability requirements without additional instrumentation.
Budget Tracking
BudgetTracker in @sourcepress/ai counts tokens across all AI functions — generate, judge, improve prompt, and any knowledge graph calls made during the run. All AI functions call recordUsage() from packages/ai/src/functions/usage.ts to register consumption.
If a run would exceed budget.daily_token_limit, the loop exits before the next iteration and returns the best result accumulated so far. The response includes budget.daily_remaining so callers can decide whether to retry later.
Track spend across runs by inspecting the budget field in each /eval/run response. No separate endpoint is required.
CLI
Run evals from the terminal using the eval command:
pnpm sourcepress eval --collection blog-post --slug vector-databases
The CLI reads evals.threshold and evals.max_iterations from config. Override per-run:
pnpm sourcepress eval --collection blog-post --slug vector-databases --threshold 0.9 --max-iterations 3
Output reports score, iteration count, and token usage. No content is written to disk until the run passes or you explicitly pass --accept-below-threshold.