Smolagents: What If Your Agent Just Wrote Python?
A thousand lines of library, a Python interpreter, and an honest answer to the question every tool-calling framework has been dancing around.
The tool-calling protocol that every major agent framework converged on is a hack. The model emits JSON, the framework parses it, the framework dispatches to a function, the function returns a value, the value gets stringified back into the prompt, and the model gets the next turn to decide what to do with it. We do this dozens of times per trajectory. We pay for it in tokens, in latency, in failure modes that are awkward to debug because the model is reasoning about side effects it cannot actually see. The whole structure exists because at some point in 2023 the labs decided the easiest way to let a model take an action was to make it produce a structured string that another system could trust, and the rest of the ecosystem treated that decision as the shape of the problem rather than one possible answer to it. Smolagents is the library that picked up the other answer and ran with it.
The pitch is short enough that I can quote the project’s own framing. The default agent in smolagents is a CodeAgent, and CodeAgent writes Python. Not JSON tool calls. Python. The model emits code, the runtime executes the code in a sandbox, the result of the execution comes back as a Python value, the agent decides whether to keep going. Tools are exposed to the agent as Python functions it can import. Multi-step reasoning becomes a script. Branching becomes an if. Iteration becomes a for. The runtime is doing what runtimes do, and the model is doing what code-trained models are fluent at, which is writing code.
The current release is 1.26.0, dated May 29, and the project is now several versions past the initial Hugging Face announcement. The agents module is still around a thousand lines of code in agents.py, which is the kind of number you can verify by opening the file. That number matters because it is the load-bearing claim of the entire project. The library is small enough to read in an afternoon. You can fit the whole control loop in your head. When something goes wrong with your agent, the surface area you have to walk through to find the bug is bounded by a file you can scroll through end to end.
I have been through enough framework debugging sessions to know what this is worth. The LangChain-era abstraction stacks are not bad because the abstractions are wrong, they are bad because the abstractions hide the loop. A production failure in an agent built on a tower of base classes is a forensic exercise. You are stepping through callback hooks and runnable graphs and intermediate representations trying to find the layer where the model’s actual output became the framework’s interpretation of the model’s output, and by the time you find it you have spent an afternoon you cannot bill to anything. The smolagents bet is that the right amount of framework is the amount you can read in one sitting, and the right way to express agent actions is the language the model already speaks fluently.
The CodeAct pattern is not a smolagents invention. The paper that named it landed in 2024, and the empirical claim was that on multi-step reasoning benchmarks, agents that wrote code as their action representation outperformed agents that emitted JSON tool calls by a margin that was hard to dismiss. The intuition is mechanical: a model that has been trained on billions of lines of code has learned the grammar of expressing structured action in code. Forcing it to emit JSON with a tool name and an argument dictionary is asking it to translate from a language it knows into a language someone designed for the convenience of the dispatcher. Every layer of translation is a place the model can make a mistake the original language would not have permitted. The CodeAct result said the obvious thing out loud, and smolagents is the cleanest production-shaped implementation of it I have used.
The complication is the one anybody who has thought about this for ten minutes will ask immediately. Letting a model write and execute arbitrary Python is letting a model write and execute arbitrary Python. The failure mode is not theoretical. A model that can write code can write code that opens a network socket, that reads a credential, that touches a filesystem you did not mean it to touch. Every CodeAct system in production has to answer the sandbox question, and the honest answer is that there is no version of the answer where the sandbox is free. Smolagents supports E2B, Modal, Blaxel, and Docker as execution backends, plus a local interpreter for development. The recommended path for anything that runs against real data is one of the remote sandboxes, which means you are paying a per-execution cost and accepting a network hop on every action.
This is the place where a different framework would hand-wave. Smolagents does not. The documentation is explicit that the local Python executor is for trusted code paths only, and the remote sandbox integrations are first-class because the team understands that this is the only way the architecture is responsible at scale. E2B is the most common pairing in the wild based on the example code in the repo and the blog posts the Hugging Face team has put out. The integration is small enough to read, the cost model is straightforward, the latency profile is what you would expect from a cold-start container. None of this is hidden. The team is not selling the pattern as free, they are selling it as worth the price.
The other piece worth naming is the model-agnosticism. Smolagents will run against any inference endpoint that speaks the relevant API: Hugging Face Inference, OpenAI, Anthropic, local Ollama, anything you can wrap in a thin adapter. This matters for the same reason it matters in every other framework: the model that is the right choice for your agent in June is not the model that is the right choice in October, and a library that locks you to a vendor is a library that ages out the moment the leaderboard moves. Letta solved this by abstracting the agent runtime away from the model. Smolagents solved it by making the model a parameter and the runtime a Python interpreter, and the result is that swapping from Claude Sonnet to a local Qwen model is a one-line change in the agent constructor.
The turn, for me, is that this is the framework I now reach for when I want to prototype an agent without committing to a stack. Letta is the answer when memory is the first-class concern and the agent has to survive a restart. Smolagents is the answer when the agent has to think through a problem that decomposes naturally into code: parsing a messy dataset, walking a directory tree, calling three APIs in sequence and reconciling the results, running a quick numerical check before deciding what to do next. The default CodeAgent gives you that for free, and the library is small enough that when I want to change the loop, I can change the loop, instead of subclassing my way through three abstraction layers to override behavior the framework never intended to expose.
The limitations are real and they are the same limitations CodeAct has everywhere. Token costs are higher per action than a JSON tool call would be, because the model is generating syntactically valid Python rather than a short structured payload. Failure modes shift from “the model emitted invalid JSON the parser rejected” to “the model emitted syntactically valid Python that did the wrong thing,” which is a different debugging problem and not obviously an easier one. The sandbox is mandatory for any serious deployment and that adds operational surface area you did not have before. None of these are reasons to skip the framework, they are reasons to know what you are signing up for when you adopt it.
What smolagents has done is force the question that every agent team should have been asking from the beginning. If the model is good at writing code, why is the action representation not code? If the action representation is code, why is the framework not built around the execution of code? If the framework is built around the execution of code, why is it not small enough to read? Hugging Face shipped a thousand-line answer to all three questions in a library you can install in one command, and the rest of the ecosystem is going to have to decide whether to follow the architecture or keep dispatching JSON until the cost gets embarrassing.
If you are building an agent right now and you have not at least read agents.py end to end, that is the afternoon I would spend this week.
If this was useful, forward it to one engineer who needs less noise in their feed.


