Core Concepts

Content Generation

How the generate-judge-improve eval loop works: setting up generation prompts, gold standards, judge.md, running evals via POST /eval/run, and tracking budget and provenance.

Content Generation

The eval loop generates content against the knowledge graph, scores it against a judge, and iterates until a quality threshold is met or a maximum iteration count is reached. No content ships until it passes.

The loop is implemented in @sourcepress/ai as EvalRunner. Budget is tracked per-run and per-day via BudgetTracker.

How the Loop Works

generate → judge → [score >= threshold] → done
                 ↘ [score < threshold]  → improve prompt → generate → judge → ...

Each iteration:

Generate — calls the generate AI function with the current prompt and the knowledge graph context.
Judge — calls the judge AI function, which scores the output against the gold standard defined in judge.md.
Evaluate — if the score meets evals.threshold, the loop exits and the content is submitted for approval. If not, the improve prompt function rewrites the generation prompt and the loop repeats.
Cap — if max_iterations is reached before the threshold is met, the loop exits with the best result so far and records the final score.

File Structure

A typical eval setup lives alongside your content collection:

evals/
  blog-post/
    prompt.md       # generation prompt
    gold.md         # gold standard (example of ideal output)
    judge.md        # scoring rubric for the judge model

`prompt.md`

The generation prompt. Written in plain text or Markdown. Reference knowledge graph entities using {{entity.slug}} interpolation — the runner resolves these before sending to the model.

Write a 600-word blog post about {{entity.vector-databases}} for an audience of
backend engineers. Cover use cases, tradeoffs, and when not to use one.
Tone: direct, no marketing language.

`gold.md`

An example of ideal output. The judge uses this as a reference when scoring. It does not need to be the exact target — it establishes the quality bar.

# When to Use a Vector Database (And When Not To)

Vector databases store embeddings and retrieve by similarity...
[full example output]

`judge.md`

A scoring rubric. The judge model reads this file and uses it to produce a numeric score between 0 and 1, plus a structured critique.

Score the generated content on the following criteria. Return a score from 0.0 to 1.0
and a brief critique for each dimension.

- Accuracy (0–0.4): All claims are supported by the knowledge graph. No hallucinations.
- Tone (0–0.3): Direct, no filler, no marketing language.
- Coverage (0–0.3): Covers use cases, tradeoffs, and when not to use the technology.

Final score = sum of dimension scores.

Running an Eval

`POST /eval/run`

Triggers the generate-judge-improve loop for a given eval configuration.

Request body:

{
  "collection": "blog-post",
  "slug": "vector-databases",
  "eval_dir": "evals/blog-post",
  "threshold": 0.85,
  "max_iterations": 5
}

Field	Type	Required	Description
`collection`	`string`	Yes	Target content collection.
`slug`	`string`	Yes	Slug for the output file. Must match `/^[a-z0-9-]+$/`.
`eval_dir`	`string`	Yes	Path to the directory containing `prompt.md`, `gold.md`, and `judge.md`.
`threshold`	`number`	No	Score (0–1) required to exit the loop. Defaults to `evals.threshold` in config.
`max_iterations`	`number`	No	Maximum loop iterations before exiting with best result.

Response (success):

{
  "status": "passed",
  "score": 0.91,
  "iterations": 3,
  "content": "...",
  "provenance": {
    "knowledge_files": ["knowledge/vector-databases.md", "knowledge/embeddings.md"],
    "model": "claude-opus-4-5",
    "judge_score": 0.91,
    "iterations": 3,
    "eval_dir": "evals/blog-post",
    "generated_at": "2026-04-05T14:22:10Z"
  },
  "budget": {
    "tokens_used": 18420,
    "daily_remaining": 481580
  }
}

Response (threshold not met):

{
  "status": "below_threshold",
  "score": 0.78,
  "iterations": 5,
  "content": "...",
  "provenance": { ... },
  "budget": { ... }
}

A below_threshold response still returns the best content produced. Whether to submit it for approval is your decision.

Configuration

Set defaults in your SourcePress config:

import { defineConfig } from "@sourcepress/core";

export default defineConfig({
  evals: {
    threshold: 0.85,
    max_iterations: 5,
    model: "claude-opus-4-5",
    judge_model: "claude-opus-4-5",
  },
  budget: {
    daily_token_limit: 500000,
  },
});

Key	Type	Description
`evals.threshold`	`number`	Score (0–1) required to pass. Applied to all runs unless overridden per-request.
`evals.max_iterations`	`number`	Loop cap. Prevents runaway spend on a single generation.
`evals.model`	`string`	Model used for generation and prompt improvement.
`evals.judge_model`	`string`	Model used for judging. Can differ from the generation model.
`budget.daily_token_limit`	`number`	`BudgetTracker` enforces this limit across all AI calls in a calendar day.

Provenance Metadata

Every file that exits the eval loop carries a provenance block. This is written into the output file before it is submitted as a GitHub PR.

# _provenance.yaml (embedded in output frontmatter or sidecar)
knowledge_files:
  - knowledge/vector-databases.md
  - knowledge/embeddings.md
model: claude-opus-4-5
judge_score: 0.91
iterations: 3
eval_dir: evals/blog-post
generated_at: 2026-04-05T14:22:10Z
approved_by: null   # populated after PR merge

Provenance records which knowledge files informed the content, which model generated it, what score the judge assigned, and how many iterations were required. After a PR is merged via GitHubPRApprovalProvider, approved_by is populated with the approving identity.

This structure satisfies EU AI Act traceability requirements without additional instrumentation.

Budget Tracking

BudgetTracker in @sourcepress/ai counts tokens across all AI functions — generate, judge, improve prompt, and any knowledge graph calls made during the run. All AI functions call recordUsage() from packages/ai/src/functions/usage.ts to register consumption.

If a run would exceed budget.daily_token_limit, the loop exits before the next iteration and returns the best result accumulated so far. The response includes budget.daily_remaining so callers can decide whether to retry later.

Track spend across runs by inspecting the budget field in each /eval/run response. No separate endpoint is required.

CLI

Run evals from the terminal using the eval command:

pnpm sourcepress eval --collection blog-post --slug vector-databases

The CLI reads evals.threshold and evals.max_iterations from config. Override per-run:

pnpm sourcepress eval --collection blog-post --slug vector-databases --threshold 0.9 --max-iterations 3

Output reports score, iteration count, and token usage. No content is written to disk until the run passes or you explicitly pass --accept-below-threshold.