Code-as-Action Is the Tool-Calling Pattern You're Underweighting
The JSON tool call won because it was the easy thing to ship, not because it was the right shape for the problem.
The default action representation for AI agents is a JSON object. The model writes a name and an argument dictionary, the framework parses the object, the framework dispatches to a registered function, the function runs, the return value gets stringified back into the prompt, and the model gets one more turn to figure out what to do with the result. We have built the entire agent ecosystem on top of this pattern and most of us have stopped asking whether it was the right pattern to begin with. It was not. It was the easiest pattern to ship in 2023, when OpenAI shipped function calling and the rest of the labs followed within a quarter, and the agent frameworks treated that decision as the shape of the problem rather than one possible answer to it. Code-as-action is the other answer, the CodeAct paper from 2024 said it out loud, and a year later the empirical case is solid enough that the framework choice you are making in 2026 should not default to JSON without a real reason.
The argument starts with what the model is actually good at. Frontier models have seen more Python than they have seen of almost any other structured language. They have been post-trained on code, they have been benchmarked on code, the capability lifts of the last two years were largely lifts on code-heavy evals. Asking a model that has been trained to fluency in Python to express its next action as a JSON object with a tool name and an argument dictionary is asking it to translate. Every translation is a place an error can happen that the source language would not have permitted. A model that wants to call three tools in sequence and reconcile the results has to emit three separate JSON payloads, get three separate dispatches, get three separate return values back into context, and reason about each step in isolation because the JSON protocol does not have a control flow primitive. The same model writing Python has a for loop and a list comprehension. It is not a contest.
The CodeAct paper from Yang et al. made the empirical case in early 2024. Agents that emitted code as their action representation outperformed JSON-tool-calling agents on multi-step reasoning benchmarks by margins that were large enough to be uncomfortable for anyone with a JSON-shaped agent in production. The numbers in the paper landed around a twenty-percent improvement in success rate on complex tasks, with the gap widening on problems that required more steps. A year and change of follow-on work has not overturned the finding. Hugging Face’s smolagents library shipped a clean implementation as the default agent type. Anthropic’s own work on computer use has the model emitting code-shaped action sequences. The labs that are not yet publicly committed to the pattern are running the experiments internally. The center of gravity in agent research moved while the production frameworks were still optimizing for the previous answer.
The complication is what every engineer who has thought about this for ten minutes asks immediately. Letting a model write and execute arbitrary Python is, in fact, letting a model write and execute arbitrary Python. The failure mode is not theoretical. A model that can write code can write code that opens a network socket, exfiltrates a credential from an environment variable, touches a filesystem path nobody meant for it to touch, or calls a billable API in a loop until the bill is in five figures. There is no version of code-as-action where the sandbox is free. Every team that adopts the pattern is signing up for a sandboxing decision and a per-execution cost the JSON-dispatch architecture did not impose.
The honest version of the answer is that the sandbox question has stopped being open. E2B will give you a remote Python sandbox with a sub-second cold start and an API that fits in a paragraph. Modal does the same with a different cost model and a different deployment story. Daytona, Blaxel, and a handful of others occupy the same space. Docker is acceptable for local development and for any environment where the network egress can be policy-controlled at the host level. The cost model is straightforward, the latency profile is predictable, and the failure modes are operational rather than mysterious. The sandbox is no longer the part of the architecture you have to invent. The part you have to invent is the policy around what the sandbox is allowed to reach, and that policy lives at the boundary of the sandbox rather than inside the agent loop, which is where security policy belongs in any system.
The other complication worth naming is observability. A JSON tool call is easy to log. The tool name and argument dictionary are structured, the dispatcher knows what was called and with what arguments, the return value is bounded in shape. A CodeAct trajectory is a sequence of Python programs the agent wrote, each of which can do anything Python can do. Reproducing a failure means capturing the code, the input state, and the sandbox environment, and replaying all three. The tooling for this is younger than the tooling for JSON-tool-call observability. Langfuse, Phoenix, and the OpenTelemetry agent semantic conventions are all making progress, but the maturity gap is real and a team adopting CodeAct should expect to spend a week of engineering effort building the trace capture they would have gotten for free from a JSON-dispatch framework.
The turn, after a year of running both architectures against real problems, is that the CodeAct cost is paid in operational complexity and the JSON cost is paid in agent capability. Operational complexity is a problem you can throw engineering at. Agent capability is a problem you can only throw a better model at, and the better model is the one that is going to be even more fluent in code. Every cycle of the model improvement curve makes the JSON-dispatch architecture relatively weaker, because the gains the labs are shipping are concentrated in the action representation the JSON dispatcher is making the model translate out of. Choosing JSON in 2026 is choosing to give up the next two years of capability lifts at the action layer because you do not want to run a sandbox today. That is a defensible decision in some contexts. It is not defensible as a default.
The contexts where JSON still wins are narrow and worth naming. Customer-facing agents with a small, fixed tool set and hard latency budgets, where every action is a database read or a CRM update and the action surface is closed by design, are fine with JSON. The translation cost is small, the capability ceiling does not bind, and the sandbox would be operational overhead for no benefit. Compliance-heavy workflows where every action has to be auditable against a pre-approved list of operations are the same: the JSON contract is the audit trail, and CodeAct’s openness is the wrong shape for the problem. The places where CodeAct dominates are the places agents are actually being asked to do interesting work. Parsing a messy dataset, walking a directory tree, calling three APIs and reconciling the results, running a quick numerical check before deciding what to do next, debugging a system by trying things and observing the responses. These are the problems where the JSON protocol’s lack of control flow shows up as a thirty-percent gap in success rate, and these are the problems your roadmap probably has more of than it has CRM-update agents.
The team that is still deciding which agent framework to standardize on in the next quarter has a real choice to make and the choice is not between LangGraph and smolagents and Letta and Atomic Agents as products. The choice is between betting on JSON-dispatch as the action representation that will carry the next two years of work, or betting on code-as-action and accepting the sandbox tax as the price of the better architecture. The empirical work points one way, the lab research direction points the same way, and the operational tooling has matured enough that the bet is not the stretch it would have been a year ago.
If you are still defaulting to JSON tool calls because that is what the SDK example showed, that is the decision to revisit this quarter.
If this was useful, forward it to one engineer who needs less noise in their feed.


