v0.1.0 · live on PyPI

An LLM eval harness that fits in CI.

A small Python library: calibrated LLM-as-judge, synthetic adversarial examples with provenance, a regression diff that fails the PR. There’s a GitHub Action that wires it together. Ragas, Inspect AI, and promptfoo plug in via adapters.

Install GitHub

κ ≥ 0.7

Calibrated judge

161

Tests passing

87%

Coverage

MIT

Licence

Install

Pick one.

Usage

Calibrate the judge, run the suite, diff in CI.

demo.pypython

1from evalkit.judges import CalibratedJudge
2from evalkit.runner import Suite
3
4# 1. Load a YAML rubric, calibrate it against your golden set.
5judge = CalibratedJudge.from_rubric(#84cc16">"rubrics/faithfulness.yaml")
6report = judge.calibrate(#84cc16">"golden.jsonl")
7assert report.cohen_kappa >= 0.7, #84cc16">"judge disagrees with humans"
8
9# 2. Run a suite over your app's outputs in parallel, get a JSON report.
10suite = Suite.from_yaml(#84cc16">"suites/main.yaml")
11result = suite.run(predictions=#84cc16">"predictions.jsonl")
12result.save(#84cc16">"current.json")
13
14# 3. In CI, diff against main and fail the PR on regressions.
15#    (Or just use the GitHub Action, see below.)

01
Calibrate the judge
A YAML rubric and your golden set in. A reliability score (Cohen's κ) out. The judge refuses to deploy if it disagrees with humans.
02
Run the suite
A parallel runner over your app's predictions. Strict Pydantic at every boundary. JSON report with provenance, retries, parallelism.
03
Diff in CI
Current vs baseline. Configurable per-metric thresholds. Regression in any one of them fails the check, automatically.

What’s in it

Four modules.

evalkit.judges

Calibrated LLM-as-judge

YAML rubric → calibrated judge with bias auditing for position, length, and self-preference. Refuses to deploy when Cohen's κ < 0.7.

Reliability · κ vs humans

κ · 0.81

evalkit.synth

Synthetic adversarial data

Taxonomy-based generation, edge cases, jailbreaks, multi-turn, PII probes, distribution shift, with provenance tracking.

Generated · last suite

240

EDGEJAILPIIMULTSHIFTAMBIOOD

evalkit diff

Regression detection

Diff vs baseline. Configurable per-metric threshold. Sticky PR comment posts the delta. Failures block merge.

Faithfulness · 30 days

-13.4pp

▸ baseline 0.847 → current 0.713

.github/workflows/eval.yml

GitHub Action that auto-rejects

One YAML, three lines. Posts a sticky regression report to your PR and blocks merge until the model is fixed or thresholds are adjusted.

- uses: AmirD10224/eval-kit@v0.1.0
  with:
    suite: suites/main.yaml
    baseline: main
    fail-on-regression: 5

▸ ≈ 30s overhead · cached calibration

In CI

When faithfulness drops 13 points, the PR check fails. Sticky comment, exit 1.

$ evalkit diffregression

MetricBaselineCurrentΔPP

answer_relevancy0.8120.835+0.023+2.3pp

closed_book0.7330.7330.0000.0pp

faithfulness0.8470.713-0.134-13.4pp

idk_when_no_ctx0.9670.9670.0000.0pp

context_precision0.7950.811+0.016+1.6pp

✗ regression detected · faithfulness · -13.4pp · threshold 5ppprocess exit 1 · CI failed

PR #82 · github checksblocked

github-actions[bot] commented now

✗ Eval gate · regression detected

✓ answer_relevancy   0.812 → 0.835  (+2.3pp)
✓ closed_book        0.733 → 0.733  (+0.0pp)
✗ faithfulness       0.847 → 0.713  (-13.4pp)  ← exceeds 5pp threshold
✓ idk_when_no_ctx    0.967 → 0.967  (+0.0pp)
✓ context_precision  0.795 → 0.811  (+1.6pp)

faithfulness regressed by 13.4pp.
3 examples of regression below, full report in the artifact.

▸ Show 3 regression examples

Q: "What does X return when Y is null?"
Baseline: cited docs/api.md L42; Current: hallucinated "returns empty list"

eval-kit v0.1.0 · suite suites/main.yaml · 50 rows · κ=0.81

eval / requiredFAILED · merge blocked

Comparison

What it covers, that nothing else does.

Feature	Hand-rolled	Ragas	promptfoo
Calibrated LLM-judge (κ ≥ 0.7)	-	-	-
Position / length / self-pref bias tests	-	-	-
Multi-rater agreement (κ, Fleiss')	-	-	-
Synthetic adversarial data (taxonomy)	partial	-	partial
Regression detection in CI	partial	-	partial
GitHub Action that blocks the PR	-	-	-
Works with Ragas + Inspect + promptfoo	-	partial	partial

Get it

One pip command and a YAML file.

Read on GitHub