HUD
The Evals Tool Built for Agents, Not Text
Most evaluation tools score text. HUD gives your agent a shell, a browser, a desktop, and a robot, then tells you whether it actually worked.
Most LLM evaluation frameworks ask the same question: how good is the output? They compare generated text against reference answers, they run LLM-as-a-judge on hallucination rates, they compute ROUGE and BLEU and answer relevance. These metrics are useful for chatbots and Q&A systems. They tell you almost nothing about whether your agent successfully completed a multi-step task in a real environment. An agent that produces a perfect plan but cannot navigate a filesystem to execute it is not a working agent. An agent that generates elegant shell commands but fails to recover when a directory does not exist is not a working agent. The evaluation surface for agents is not the text they produce. It is the environments they operate in and whether they achieve their objectives there. That distinction sounds obvious when you write it down. Almost nobody’s evaluation stack reflects it.
HUD is built around that distinction. It is an MIT-licensed Python library and platform from a small team shipping out of an even smaller GitHub org, 264 stars, 59 forks, a Discord server with fewer members than most AI influencer group chats. It does not have the brand recognition of Langfuse or the pytest integration that made DeepEval famous. What it has is a protocol that treats agent evaluation as the problem of defining environments, capabilities, and tasks, then running agents through them and collecting rewards. This is not a framework for scoring text outputs with a few extra agent metrics bolted on. This is a framework built from the ground up around the idea that agents do things in the world, and evaluating them means watching them do those things.
The architecture is protocol-first. An environment and an agent exchange exactly three things: a manifest that declares the environment’s capabilities and available tasks, a tasks.start call that returns the prompt, and a tasks.grade call that returns the reward. Everything that happens in between is the agent driving the environment’s capabilities directly. HUD owns only that thin envelope. The protocol does not prescribe what an agent looks like or which model powers it. It only defines the contract: here is what the environment can do, here is what the task asks, here is how the task gets scored. Any model or harness that speaks that contract can run against any HUD environment. The evaluation definition outlives any single agent implementation. That is not a design detail. That is the architecture decision that separates an eval platform from an eval script.
The capabilities are where HUD earns its claim on agent evaluation specifically. An environment can expose five protocol-level capabilities: ssh for shell and filesystem access in a sandboxed workspace, mcp for tools over the Model Context Protocol, cdp for browser control over the Chrome DevTools Protocol, rfb for full computer use over VNC with screen capture and keyboard/mouse input, and robot for schema-driven observation and action loops over WebSocket. These are not abstract evaluation dimensions. These are real interfaces to real execution environments. An agent that needs to fix a bug in a codebase gets an ssh capability with a workspace root. An agent that needs to navigate a web application gets a cdp capability with a real browser. An agent that needs to operate a GUI gets an rfb capability with a real desktop. The evaluation is not asking whether the agent’s answer sounds right. It is asking whether the agent actually did the thing.
The template system is the other half of the design that makes HUD practical. A template is an async generator decorated with @env.template(). You yield a prompt, receive the agent’s answer, yield a reward. One function spans a whole dataset of variants. The simplest example needs no capabilities at all, a prompt and a grader. A letter-counting task yields “How many ‘r’s are in ‘strawberry’?” and scores 1.0 if the answer contains the right number. Parameterize the word and you have a dataset. Add an ssh capability to the environment and the same template pattern extends to “fix the bug in src/auth.py where login fails for users with special characters in their password.” The grader checks whether the test suite passes after the agent’s changes. The template abstraction does not change between a text-only task and a full shell-plus-browser task. That uniformity is the thing that makes HUD work as an evaluation platform rather than a collection of scripts. You define tasks the same way regardless of capability surface. You run them the same way. You collect rewards the same way.
The version that shipped today, v0.6.3, lands on top of a v0.6.0 release from yesterday that rewrote the protocol layer. The v0.6.0 change is worth understanding because it clarifies what HUD is and is not trying to be. Before v0.6.0, environments carried some agent-tool wiring internally. After v0.6.0, environments expose only a thin control channel with capabilities, and agent harnesses own the tools entirely. An ssh capability means the environment provides shell and files. It does not provide a “run command” tool. The harness attached to the agent supplies that tool by wiring it to the capability. This is a separation that most frameworks blur. By keeping the environment unaware of the agent’s tool definitions, HUD ensures that the same environment works with a Claude harness that defines tools one way, a GPT harness that defines them another, and a custom harness you write yourself for a model the platform does not natively support. The environment is the fixed point. The agent is the variable. The evaluation is the constant.
The native agent support covers Claude, OpenAI Responses, OpenAI-compatible endpoints, and Gemini via create_agent("claude-sonnet-4-5") or the equivalent model string. The harness wires capability-backed tools for the model you pick at runtime. If you want to wrap browser-use on cdp or plug in a custom VLA policy on robot, you write a harness that attaches to the capability and defines a tool spec. No protocol work is required. The environment does not know or care which harness connected to it. This is the cleanest separation of environment and agent in the open-source evaluation ecosystem. It is also, at 264 stars, almost completely unknown.
The training integration is where HUD departs from the evaluation-only platforms entirely. Every rollout returns a Run carrying a trace ID and a reward. The same tasks you evaluate on are already training data. Run a group of 16 rollouts per task and compute GRPO advantages with group_relative(), normalizing standard deviation across the group. Feed the trace IDs and advantages into your optimizer. HUD is the environment-and-reward source for your own GRPO or PPO loop. The same environment trains any model, text or multimodal, without modification. This is not an evaluation tool that happens to export data you could use for training if you built the pipeline yourself. This is an evaluation platform designed with RL training as a first-class downstream use case. The docs have a dedicated section called “Designing tasks for signal” that covers exactly this: how to structure your task definitions so the rewards carry information an optimizer can actually use. Most eval platforms do not have that section because they were not built to feed optimizers. HUD was.
The platform piece, hud.ai the hosted service, handles batch runs, model comparison on the same taskset, and trace inspection. It is not required. You can run everything locally with hud build against a Docker container. The CLI is clean. hud init my-env scaffolds a project. hud deploy builds and registers your environment on the platform. hud sync tasks my-taskset pushes a taskset. hud eval my-taskset --remote runs it against the platform. The local path uses the same protocol against a container on your laptop. This is the right split. The open-source library is fully functional without the platform. The platform adds orchestration, comparison, and leaderboards. You are not evaluating a SaaS with an open-source demo tier. You are evaluating an open-source library with an optional hosted service.
The tradeoffs are real and worth naming. HUD is young. The GitHub org has a handful of contributors. The documentation at docs.hud.ai is thorough for the core protocol, capabilities, and task definition, but the training section is thinner and the robot capability is still marked beta. The community is small enough that if you run into a capability bug on a Friday evening, you are probably the first person to hit it. This is not a tool with a thousand open issues and a dedicated support team. It is a tool where the architecture is excellent and the bus factor is real. The other side of that tradeoff is that the protocol separation means your environments are not locked to HUD’s agent implementations. If the project goes quiet, your task definitions and environment images still work. Any harness that speaks the same thin protocol can drive them. That is a better failure mode than most evaluation platforms offer.
The thing HUD gets right that almost nobody else does is that agent evaluation is not a harder version of text evaluation. It is a different category of problem. Text evaluation asks whether the output matches the reference. Agent evaluation asks whether the agent achieved the objective in the environment. The first question needs a grader. The second question needs an environment, capabilities, a task definition, and a reward function, and the grader is the last step in a chain that starts with giving the agent a real shell, a real browser, or a real desktop to work in. HUD builds the chain. Most evaluation tools build only the last link and call it done. That is the difference between scoring text and evaluating agents. HUD is the only MIT-licensed tool in the ecosystem that starts from the second question and works backward. At 264 stars, it is also the most underweighted pick in the entire evaluation stack.
If this was useful, forward it to one engineer who needs less noise in their feed.


