The Eval Tools Everyone Talks About Are Not the Tools Frontier Labs Actually Use
The public eval conversation is loud and wrong. The tools that matter are quieter than you think.
The public discourse on LLM evaluation has a shape. It is dominated by a handful of hosted SaaS platforms with good marketing and generous free tiers. Their names show up in every “top eval tools” listicle. Their CEOs get quoted in the trade press. Their sales decks land in CTO inboxes promising enterprise-grade evaluation infrastructure with dashboards and alerts and SOC 2 compliance and a per-trace pricing model that looks cheap in the demo and gets expensive the moment your agent starts running a hundred steps on a Tuesday afternoon. This is not a critique of any specific platform. Some of them are good. Some of them solve real problems for teams that do not have the capacity to run their own eval infrastructure. But the shape of the discourse is strange, because the labs that ship the frontier models you are evaluating do not use any of them.
What the frontier labs use is quieter, weirder, and almost entirely open-source. It is built by governments and research institutes and small teams of evaluation specialists who spent years thinking about measurement before anyone offered them a dashboard. It is available on GitHub organizations nobody browses and PyPI packages nobody retweets. It has no marketing budget because it was never a product. It was built to answer a question: does this model actually work, and how would we know if it stopped? The answer to that question, it turns out, does not need a SaaS platform. It needs a framework that can run thousands of evals in parallel, score them with reproducible metrics, and produce a result that holds up under scrutiny six months later when a regulator or a customer asks you to prove it.
The stack breaks into layers, and each layer has a tool the public conversation mostly ignores.
The bottom layer is the harness: the thing that runs the evals, manages the concurrency, collects the results, and does not corrupt them in the process. Every frontier lab runs some variant of EleutherAI’s lm-evaluation-harness, a library that has been the de facto standard for running standardized benchmarks since before most people had heard of a transformer. It supports hundreds of tasks across dozens of benchmark suites. It handles model loading across HuggingFace, vLLM, and OpenAI-compatible APIs. It manages few-shot formatting, log-probability extraction, and result aggregation with a level of rigor that comes from being the tool academics use to reproduce each other’s numbers. When you read a model release blog post and see a table of benchmark scores, the harness that produced that table was almost certainly lm-eval. It is the closest thing the field has to a reference implementation, and it is not mentioned in any SaaS vendor’s comparison page.
Above the harness sits the task framework: the thing that lets you define custom evals that match what your model actually does in production. The public conversation about evals tends to treat MMLU and GSM8K and HumanEval as though they are sufficient. They are not, and nobody at a frontier lab thinks they are. The benchmarks that matter are the ones you write yourself: the multi-turn agent task where the model has to use six tools in sequence without getting confused, the retrieval task where the corpus is your actual documentation and the questions are pulled from your actual support tickets, the safety eval where the model has to refuse a specific class of harmful request without refusing everything adjacent to it. This is where Inspect AI enters, and it is the tool that reveals the gap between the public eval conversation and what practitioners actually need. Built by the UK AI Security Institute for evaluating frontier models before they ship, Inspect AI provides a solver-and-scorer architecture where evals are Python functions that can use tools, spawn sub-agents, query external APIs, and run in sandboxed containers. It has dataset versioning so your eval results from six months ago are reproducible against the same inputs. It has first-class support for agentic tasks, meaning the model under test can use tools, make plans, and execute multi-step trajectories while the eval framework captures every intermediate state. It runs anywhere, requires no hosted service, and is maintained by a government lab that has no incentive to sell you anything. It is the most industrial-grade eval framework in the open, and the public eval conversation almost never mentions it.
Above the task framework sits the metric layer. Scoring an LLM output is harder than it looks. String matching fails on legitimate paraphrases. LLM-as-judge can be gamed by the same model it is evaluating. Reference-free metrics can miss failures that a human would catch immediately. The metric layer is where DeepEval lives, a pytest-native framework that drops into existing CI pipelines and provides 40-plus built-in metrics covering hallucination, bias, toxicity, faithfulness, answer relevance, and contextual recall. The interface is what makes it different: assert_test(test_case, [metric]) runs like any other unit test, which means eval failures show up in CI the same way a failing integration test does. The maintainers ship day-zero support for every major model release. It is the most engineer-native eval tool in the ecosystem, and the reason the public conversation undersells it is simple: it is not a platform. It is a library. There is no dashboard to screenshot for a Twitter thread.
Above the metric layer sits the observability layer, and this is where the platform vendors are actually competing with each other. But the frontier labs run OpenLLMetry, an OpenTelemetry-native instrumentation SDK that emits standard OTel spans for every LLM call, every vector DB query, every agent tool invocation. One Traceloop.init() call and your agent’s traces appear next to your HTTP and database spans in whatever observability stack you already run. The platform vendors built their own tracing ecosystems because they wanted to own the data. The labs that already run Datadog or Grafana or Honeycomb do not need another tracing ecosystem. They need their AI traces to show up in the one they already have.
The top layer is agent-specific evaluation, and this is the newest and most unsettled part of the stack. Most eval tools were built to score text outputs. They were not built to evaluate an agent that opens a browser, navigates a web application, fills out a form, and checks the result. HUD is one of the few tools built for this. It provides environment-based benchmarks where the agent runs inside a real containerized environment and its actions are scored against the resulting state, not the text it emits. It integrates with Inspect AI as a runner, which means you can define an eval in Inspect and execute it inside a HUD environment with a containerized browser or terminal session. This is genuinely novel infrastructure, and the reason it is not on the front page of every eval roundup is that it is less than a year old and maintained by a small team that does not market.
The pattern across all five layers is the same: the tools exist, they are open-source, they are actively maintained with releases within the last month, and the public eval discourse barely knows about them. The discourse is shaped by what has a marketing budget, and none of these tools do. lm-eval is maintained by EleutherAI, a research collective. Inspect AI is maintained by a government lab. DeepEval is maintained by a small startup that builds its paid product on top of the open-source core but does not push the core into platform territory. OpenLLMetry is maintained by Traceloop, a company that would rather you use their open-source instrumentation than give your trace data to a competitor. HUD is maintained by a team that is still figuring out what the product is. None of these organizations has ad spend for the keyword “LLM evaluation” on Google. None of them sends out a weekly newsletter with a subject line that starts with the fire emoji. None of them pays for the analyst report placement that gets them into the CTO’s enterprise evaluation spreadsheet.
The result is a market that is upside down. The tools with the most marketing are evaluated by teams that are still figuring out what they need. The tools with the least marketing are the ones that the teams shipping the most advanced models actually use. If you are building an eval pipeline today, you have two paths. Path one is the one the public conversation recommends: pick a hosted platform, onboard your data, configure your metrics, pay per trace, and accept that your eval infrastructure is a cost center with vendor lock-in. Path two is the one the frontier labs built: start with lm-eval for standardized benchmarks, add Inspect AI for custom task evals, layer in DeepEval for CI-native metric scoring, wire OpenLLMetry into your existing observability stack for tracing, and evaluate HUD if your use case involves agents interacting with environments. Path two is more work up front. It requires you to run infrastructure. It requires you to understand your eval needs well enough to configure a framework rather than clicking through a dashboard. But it gives you something Path one does not: evals you own, results you can reproduce, and a stack that does not get more expensive every time your agent takes an extra step.
The eval arc we are kicking off this week is a walk through Path two. Over the next six days, we will go deep on each layer. Inspect AI on Tuesday: the solver-and-scorer architecture, the sandboxed tool execution, the dataset versioning that makes regulatory audit survivable. DeepEval on Wednesday: the pytest-native interface, the 40-plus metrics, and what it looks like to have eval failures blocking a CI pipeline. Comet Opik on Thursday: the open-source platform play for teams that want dashboards without vendor lock-in. HUD on Friday: the agent-specific eval infrastructure that the rest of the ecosystem is still catching up to. OpenLLMetry on Saturday: the observability layer that makes your agent visible to the monitoring stack you already pay for. Each tool does something specific, and together they form a stack that has shipped more frontier models than any hosted platform.
The eval conversation has been captured by the narrative that you need a platform to do evaluation at scale. You do not. You need a framework that measures what matters, a pipeline that runs reliably, and results you can defend. The tools for that exist. They are free. They are here.
If this was useful, forward it to one engineer who needs less noise in their feed.


