evalkit.judges
Calibrated LLM-as-judge
YAML rubric → calibrated judge with bias auditing for position, length, and self-preference. Refuses to deploy when Cohen's κ < 0.7.
Reliability · κ vs humans
κ · 0.81
A small Python library: calibrated LLM-as-judge, synthetic adversarial examples with provenance, a regression diff that fails the PR. There’s a GitHub Action that wires it together. Ragas, Inspect AI, and promptfoo plug in via adapters.
κ ≥ 0.7
Calibrated judge
161
Tests passing
87%
Coverage
MIT
Licence
Install
Usage
1from evalkit.judges import CalibratedJudge2from evalkit.runner import Suite34# 1. Load a YAML rubric, calibrate it against your golden set.5judge = CalibratedJudge.from_rubric(#84cc16">"rubrics/faithfulness.yaml")6report = judge.calibrate(#84cc16">"golden.jsonl")7assert report.cohen_kappa >= 0.7, #84cc16">"judge disagrees with humans"89# 2. Run a suite over your app's outputs in parallel, get a JSON report.10suite = Suite.from_yaml(#84cc16">"suites/main.yaml")11result = suite.run(predictions=#84cc16">"predictions.jsonl")12result.save(#84cc16">"current.json")1314# 3. In CI, diff against main and fail the PR on regressions.15# (Or just use the GitHub Action, see below.)
Calibrate the judge
A YAML rubric and your golden set in. A reliability score (Cohen's κ) out. The judge refuses to deploy if it disagrees with humans.
Run the suite
A parallel runner over your app's predictions. Strict Pydantic at every boundary. JSON report with provenance, retries, parallelism.
Diff in CI
Current vs baseline. Configurable per-metric thresholds. Regression in any one of them fails the check, automatically.
What’s in it
evalkit.judges
YAML rubric → calibrated judge with bias auditing for position, length, and self-preference. Refuses to deploy when Cohen's κ < 0.7.
Reliability · κ vs humans
κ · 0.81
evalkit.synth
Taxonomy-based generation, edge cases, jailbreaks, multi-turn, PII probes, distribution shift, with provenance tracking.
Generated · last suite
240
evalkit diff
Diff vs baseline. Configurable per-metric threshold. Sticky PR comment posts the delta. Failures block merge.
Faithfulness · 30 days
-13.4pp
▸ baseline 0.847 → current 0.713
.github/workflows/eval.yml
One YAML, three lines. Posts a sticky regression report to your PR and blocks merge until the model is fixed or thresholds are adjusted.
- uses: AmirD10224/eval-kit@v0.1.0
with:
suite: suites/main.yaml
baseline: main
fail-on-regression: 5▸ ≈ 30s overhead · cached calibration
In CI
github-actions[bot] commented now
✗ Eval gate · regression detected
✓ answer_relevancy 0.812 → 0.835 (+2.3pp) ✓ closed_book 0.733 → 0.733 (+0.0pp) ✗ faithfulness 0.847 → 0.713 (-13.4pp) ← exceeds 5pp threshold ✓ idk_when_no_ctx 0.967 → 0.967 (+0.0pp) ✓ context_precision 0.795 → 0.811 (+1.6pp) faithfulness regressed by 13.4pp. 3 examples of regression below, full report in the artifact.
Q: "What does X return when Y is null?"
Baseline: cited docs/api.md L42; Current: hallucinated "returns empty list"
eval-kit v0.1.0 · suite suites/main.yaml · 50 rows · κ=0.81
Comparison
| Feature | EvalKit | Hand-rolled | Ragas | promptfoo |
|---|---|---|---|---|
| Calibrated LLM-judge (κ ≥ 0.7) | - | - | - | |
| Position / length / self-pref bias tests | - | - | - | |
| Multi-rater agreement (κ, Fleiss') | - | - | - | |
| Synthetic adversarial data (taxonomy) | partial | - | partial | |
| Regression detection in CI | partial | - | partial | |
| GitHub Action that blocks the PR | - | - | - | |
| Works with Ragas + Inspect + promptfoo | - | partial | partial |
Get it