← Back to Writing

HNIR-CCP vs LLMs: What 100 Governance Scenarios Actually Show

2026-03-29
#AI Safety#Control Planes#LLM Evaluation#Governance

Most AI agent systems today make the same architectural assumption:

Route everything through an LLM — even when the task is deterministic.

In the previous post, I introduced HNIR-CCP, a deterministic control plane designed to intercept and handle governance-critical operations before they reach an LLM.

This post answers a simple question:

Does that actually matter in practice?

To test this, I ran a structured evaluation comparing HNIR-CCP against multiple frontier LLMs and guardrail frameworks.

Evaluation setup

The goal was not to benchmark general intelligence, but to isolate governance behavior.

Systems evaluated

Scenario design

100 structured scenarios across four categories:

CategoryCount
Adversarial30
Control commands20
Policy enforcement30
State transitions20

Each scenario was executed 5 times per system to measure consistency.

Critical control: Temperature = 0

All LLM evaluations were run with temperature = 0. This removes sampling variability and ensures deterministic decoding, repeatable outputs, and no "worst-case cherry-picking."

The goal was to evaluate capability, not randomness.

Results

1. Governance compliance

SystemCompliance
HNIR-CCP100%
Claude Opus91%
o387%
Claude Sonnet81%
Gemini 2.5 Pro80%
GPT-4o79%
NeMo Guardrails48%
Guardrails AI37%

No LLM reached full compliance. Even with explicit policy prompting, failures persist.

2. Latency

SystemLatency
HNIR-CCP~0.04 ms (40.6 µs)
GPT-4o~0.7 s
Claude Sonnet~1.1 s
Claude Opus~1.9 s
Gemini 2.5 Pro~5.7 s

HNIR-CCP is ~17,000x faster than GPT-4o and ~140,000x faster than Gemini.

3. Cost

SystemCost per 100 scenarios
HNIR-CCP$0.00
GPT-4o$0.233
o3$0.480
Claude Opus$1.657

LLM governance is not just inconsistent — it's expensive at scale.

4. Consistency and "confident-but-wrong"

A key failure pattern emerged: models often give the same answer repeatedly — and that answer is wrong.

Formally, this means >=80% agreement across runs with 0% correctness. This is labeled confident-but-wrong.

SystemConfident-but-wrong
NeMo50%
Claude Sonnet25%
GPT-4o24%
o320%
Claude Opus18%

This is not randomness. It is systematic failure under deterministic decoding.

What this actually means

This is the key takeaway:

Governance is not a reasoning problem. It is a control problem.

LLMs are probabilistic, context-sensitive, and non-reproducible under edge conditions. That's fine for summarization, generation, and open-ended reasoning.

But it breaks for access control, state validation, and policy enforcement.

The architectural shift

Instead of:

User -> LLM -> Everything else

HNIR-CCP introduces:

User -> Control Plane -> (LLM only if needed)

Where:

What this is NOT claiming

This is not "LLMs are bad" or "LLMs should not be used."

This is:

LLMs are the wrong abstraction for certain classes of problems.

Full paper

The full evaluation, methodology, and results are available here:

https://zenodo.org/records/19324744

Reference implementation

HNIR-CCP implementation:

https://github.com/Teknamin/hnir-ccp


Source: https://github.com/Teknamin/hnir-ccp Lab: https://www.teknamin.com Author: https://www.raviaravind.com