The AI-Decision Evidence Benchmark

Your AI logs something. Could it survive an examiner?

AEB is an open, format-neutral benchmark that scores any AI decision record against the five questions a regulator, an enterprise security review, or opposing counsel actually asks. It is run, deterministic, and the scorer is open — because an evidence benchmark you have to trust us about would fail its own hardest test.

"If someone demanded proof of what your AI decided, and would not accept ‘the model said so’ — what could you hand them?"

The leaderboard

Decision-record format	Attribution	Policy	Approval	Integrity	Independence	Score
GateFrame Decision Provenance Record	2	2	2	2	2	10/10
AWS CloudTrail event	2	0	1	1	1	5/10
OpenTelemetry span (GenAI semconv)	2	0	0	0	0	2/10
Plain application log (JSON)	1	0	0	0	0	1/10

// Every operator-controlled format fails dimension 5. CloudTrail scores best among them — it has genuine cryptographic log-file validation — but that integrity is attested by the cloud provider and the very account under audit. It cannot answer "verify without trusting the operator." That is the dimension an adversarial examination turns on.

What it measures

Attribution

Which model/agent version decided — pinned, not inferred?

Policy enforcement

What policy or scope applied, captured with the decision?

Approval

Who or what approved it — is that in the record?

Integrity

Can you prove the record wasn’t altered afterward?

05 · THE GAP

Independent verifiability

Can a third party verify all of the above without trusting the operator?

// 0 = absent · 1 = present but operator-controlled / partial · 2 = present and independently verifiable. The rubric is format-neutral: any record that meets all five scores 10. GateFrame is not privileged by the rubric — it is built to it.

Score your own logs

Run it yourself

The harness, the rubric, and the deterministic scorer are open. Point it at your real decision logs and read exactly why each dimension scored what it did.

# scores the reference formats
python benchmark.py

# scores YOUR decision log
python benchmark.py --file your_log.json

Why open?

An evidence benchmark you had to take on faith would fail dimension 5 itself. Every score here is reproducible from the published scorer. Disagree with a number — read the code and tell us where it’s wrong. That is the standard we hold our own records to.

The finding

Across every operator-held logging approach in common use, the record can establish what happened to a degree — but not in a form a third party can verify without trusting the party being examined. Independent verifiability is the dimension that decides an adversarial review, and it is the one no operator-controlled log satisfies. A Decision Provenance Record closes it: a signature any examiner checks against a published key, with no involvement from the operator and none from GateFrame.

See your real number.

We’ll run AEB against a sample of your production AI-decision logs and walk you through exactly where the evidence gap is — and what closing dimension 5 takes. A focused pilot, not a sales call.

Request a benchmark pilot Read the source on GitHub View a live signed record