Inspect AI
The Eval Framework the UK Government Uses to Test Frontier Models
Built by a government lab with no marketing budget. Used by the people who decide whether a frontier model ships.
The UK AI Security Institute evaluates frontier models before they are released. This is not a hypothetical. It is a statutory responsibility, carried out by a government body that has to produce results that hold up under parliamentary scrutiny and international regulatory pressure. The tool they built to do it is called Inspect AI. It is open-source, MIT-licensed, maintained on a GitHub organization nobody browses, and released at version 0.3.240 yesterday afternoon. It is the most industrial-grade evaluation framework in the open, and hardly anyone in the commercial AI ecosystem has heard of it.
Inspect AI is not a product. It has no pricing page, no enterprise tier, no SOC 2 compliance badge, no sales team that sends you a calendar link. It is a Python framework that runs on your infrastructure, defines evals as code, scores them with reproducible metrics, and produces results you can defend to a regulator six months later. The architecture is solver-and-scorer. A solver is anything that produces an answer: a single model call, a chain of prompts, a ReAct agent with tools, an external coding agent bridged over ACP. A scorer is anything that evaluates that answer: exact match, model-graded QA, a custom Python function that checks whether the agent’s filesystem state matches the expected result. This separation is not novel in the abstract. What is novel is how thoroughly Inspect treats both halves as first-class programmable concerns rather than configuration-driven afterthoughts.
A solver in Inspect is a solver function that takes a TaskState and returns a TaskState. That means solvers compose. You can wrap a basic generate() call in a system_message() that sets the prompt, chain it with a chain_of_thought() that makes the model reason before answering, fork it with a multi_scorer() that runs the same answer through five different grading rubrics, or drop a full react() agent into the solver slot and let it use tools across dozens of turns. The framework does not care whether the solver is a one-line model call or a hundred-step agent trajectory with bash access and web browsing. The contract is the same: produce a state, hand it to the scorer, record the result. This is what “industrial-grade” means in practice. Not features. Composability under load. The ability to swap out the solver without touching the dataset, the scorer, the sandbox configuration, or the reporting pipeline. The framework that shipped the SimpleQA benchmark last month should be able to run a CTF challenge eval this month by changing one import and two lines.
The scorer side is equally programmable. Inspect ships with model-graded scorers for factual accuracy, instruction following, and multiple-choice scoring. It also lets you define custom scorers as arbitrary Python that receives the model’s output, the target answer, and the full sample metadata. This is where the framework separates from platforms that give you five predefined metric types and call it coverage. If your eval needs to check whether the model’s SQL query returns the correct row count within a 500ms time budget, you write a scorer that does that. If your eval needs to grade the model’s response against a rubric that your compliance team drafted in a Google Doc, you parse the rubric into a prompt template and pass it to model_graded_fact(). The framework provides the scaffolding. The actual measurement is yours.
Sandboxed execution is where Inspect does something that most eval frameworks skip because it is hard. When your eval involves an agent that runs code, opens files, or executes shell commands, that code should not run on the machine that is doing the evaluating. Inspect’s sandbox system isolates every tool call in a Docker container, a Kubernetes pod, or a Modal sandbox, configurable via a Dockerfile or compose.yaml that lives alongside the task definition. The default is Docker, which means you can define a CTF challenge eval that gives the agent a bash tool, a python tool, and a filesystem with a hidden flag, and every invocation of those tools executes inside a container that resets between samples. The sandbox layer also supports sandbox_service(), which lets you run persistent sidecar processes inside the sandbox for tasks that need a database or a mock API server. This is the kind of infrastructure that takes a team of evaluation engineers months to build from scratch, and Inspect provides it as a configuration parameter.
Dataset versioning is the feature that makes regulatory audit survivable. Inspect reads datasets from Hugging Face, CSV, JSON, or in-memory Python. Every dataset load records a hash of the input data so that you can prove, six months later, that the eval results you are showing a regulator were produced against the same inputs. This sounds like table stakes until you are the person who has to demonstrate it. The versioning is automatic. The hash is in the eval log. The log is a JSON file you can store wherever your compliance policy requires. There is no hosted service that needs to exist in six months for your audit trail to be complete.
The agent support deserves its own paragraph because it is where Inspect has pulled ahead of the eval pack in the last six months. The built-in react() agent runs a standard reason-act-observe loop with tool access and configurable attempt limits. But the genuinely distinctive feature is the agent bridge, which lets Inspect evaluate external agents as though they were native solvers. If you want to evaluate how Claude Code performs on a software engineering benchmark, you configure agent_bridge("claude_code") as your solver and Inspect spawns Claude Code inside a sandbox, feeds it the task prompt, captures every tool call and response, and scores the result against your target. The same bridge works for Codex CLI, Gemini CLI, and any agent that speaks ACP. This means Inspect is not just a framework for evaluating models. It is a framework for evaluating the thing your users actually interact with, which is increasingly not a model but an agent wrapped around one.
The control channel that shipped in yesterday’s release is a window into where this is heading. inspect eval now binds a per-process Unix-domain socket that exposes a read surface for the live run. Another process can connect and observe the eval in progress: current sample, score so far, errors encountered. The inspect ctl commands let CLI tools, scripts, and other agents query a running eval without touching the log file. This is the kind of feature that a platform would build as a paid add-on with a real-time dashboard upsell. Inspect exposes it as a local socket because the AISI’s use case is automation, not demonstration. When you are running a thousand evals in parallel across a cluster, you do not want a dashboard. You want a control plane that your orchestration layer can query programmatically.
The numbers on the repo tell a story that the marketing numbers do not. 2,206 stars. 563 forks. 216 open issues, most of which are feature requests or provider integration discussions with active maintainer responses within 48 hours. The changelog ships weekly. The release that landed yesterday adds model fallback tracking, server-side tool classification, Hugging Face Storage Bucket support for eval logs, and a redesigned Inspect View with dark mode and virtualized transcript rendering. The release before that, six days earlier, added Anthropic Claude 5 support and the control channel. The release cadence is faster than most venture-funded developer tools, and the maintainers are a government lab and a public-benefit corporation.
Two thousand stars is not nothing. It is also not the number you would expect for a framework that has evaluated more frontier models in a regulatory context than any commercial platform. The reason for the gap is straightforward. Inspect AI was never marketed as a product because it was never intended to be one. The AISI built it to solve their own evaluation problem, open-sourced it because government-built infrastructure should be public, and continues to maintain it because their own work depends on it. The GitHub org is UKGovernmentBEIS. The logo is the AISI crest. The documentation is thorough, well-structured, and reads like technical documentation rather than a product pitch. The project has no Twitter presence, no Discord community, no conference booth, and no investor deck. It has a PyPI package that updated yesterday and a changelog that reads like it was written by engineers who actually run the evals they are documenting.
The consequence is a framework that is quieter than it should be and more capable than most people assume. If you are building an eval pipeline for agentic AI systems and you are considering a hosted platform, the question to ask is not whether the platform has more features than Inspect. It is whether the platform ships every feature you actually need, and whether the ones it ships will still be around at the same price point in two years. Inspect ships more eval infrastructure out of the box than most teams will ever configure, requires no vendor relationship, and is maintained by an organization whose incentives are aligned with accurate measurement rather than recurring revenue. That combination is rare enough to be worth noticing.
The eval arc continues tomorrow with DeepEval, the pytest-native framework that drops LLM metrics into your CI pipeline. The stack is coming together. Inspect is the foundation.
If this was useful, forward it to one engineer who needs less noise in their feed.


