The Agent Framework Wars: A Practitioner Scorecard
Six frameworks, four criteria that actually matter in production, and the honest verdict after a month of writing about them.
I have spent the last four weeks of this newsletter writing one tool spotlight after another on the agent frameworks that currently matter: AWS Strands, LangGraph, CrewAI, AutoGen 0.4, Pydantic AI, OpenAI’s Agents SDK, with a side trip through Semantic Kernel and the TypeScript camp for completeness. Every one of those posts was an attempt to be fair to the framework on its own terms. This one is not. This one is the head-to-head, in the voice of the person who has to pick one for a real team and live with the consequences. The criteria I care about are not “developer experience” or “community size” or any of the other variables that show up on vendor comparison pages. The criteria are debuggability, lock-in, type safety, observability, and team-shareability, because those are the five things that decide whether a framework choice quietly costs you a quarter of your engineering capacity eighteen months in.
Debuggability is the one I lead with because it is the one nobody scores against until it is too late. The honest test is not “can I attach a debugger to a Python process” but “when an agent run produces the wrong answer in production, how many minutes does it take a mid-level engineer on my team to find the cause.” LangGraph wins this category by a margin that surprised me when I went back through my notes from the month. The graph is explicit, the state transitions are inspectable, and when something goes wrong you can replay the run with the same inputs and watch it happen. CrewAI sits in the middle: the role-based abstraction makes the happy path easy to read and makes failure modes harder to localize, because the framework is doing more of the orchestration on your behalf. AutoGen 0.4 is better than the 0.2 reputation it inherits but still suffers from the multi-agent conversation pattern that makes “whose turn was it and why” a hard question to answer six steps deep. The OpenAI Agents SDK is the worst of the lot for debugging, not because the framework is broken but because the abstraction is thin and the failure mode is usually “the model made a choice you cannot inspect.”
Pydantic AI and Strands sit at the top of the pile here for different reasons. Pydantic AI is debuggable because the type contracts catch a category of error before it becomes a runtime mystery. If a structured output does not validate, you know exactly which field failed and why, which collapses a whole class of “the agent gave me garbage” investigations into a fixable error message. Strands is debuggable for the opposite reason: AWS chose to surface tracing as a first-class concern from the first commit, and the observability hooks are wired through every primitive, so the question of “what did the agent actually do” has a real answer without bolting on a third-party tool. The two of them solve different debugging problems, and a team that uses both for different services is doing the right thing.
Lock-in is the criterion the vendors least want you to think about, which is exactly why it should be the second filter. The OpenAI Agents SDK is the most locked-in framework on the list by a wide margin, and the lock-in is not about model choice, because every framework supports model swapping at this point. The lock-in is in the runtime primitives, the tracing model, the way handoffs are expressed, the assumption that you will use OpenAI’s evals product downstream. None of those are unreasonable choices on their own. They are unreasonable in aggregate, because the cost of leaving compounds across all of them at once. Strands has the same shape of risk wearing different colors: AWS-flavored conventions, AWS-flavored observability story, AWS-flavored deployment path. The framework is open source and the code will keep running if you walk away. The ecosystem around it will not.
LangGraph, CrewAI, AutoGen, and Pydantic AI are the four that score well on lock-in, and the reasons are worth distinguishing. LangGraph is portable because the graph is a data structure you own; the framework is mostly a runtime for executing it, and if you ever needed to rewrite that runtime you could do it in a week. CrewAI is portable because the role and task abstractions are simple enough to reimplement on top of anything else if you had to, and the team has not built a moat around any particular hosting story. AutoGen 0.4 is portable because Microsoft Research has historically published the protocol designs as research artifacts, which means the abstractions get documented at a level that survives the framework. Pydantic AI is portable in the strongest sense, because the type definitions are the contract and the framework is a thin layer around them; the structured outputs and tool schemas would survive a swap to anything else with two days of glue code.
Type safety is the criterion that splits the field in a way that has nothing to do with the Python-versus-TypeScript debate, although that is the way it usually gets framed. The real question is whether the framework treats the schema as a first-class concern or an afterthought. Pydantic AI wins this category by a margin that is almost embarrassing to the rest of the field, because the framework is built around the premise that structured outputs and tool definitions should be expressed in the same Pydantic models you already use for your API contracts. The end result is that a production AI service ends up with the same level of contract enforcement as a normal HTTP service, which is the thing that lets your CI actually catch regressions before they reach a customer. Mastra, on the TypeScript side, makes the same bet with Zod and earns the same advantage for teams whose backend is already TypeScript.
LangGraph, CrewAI, and AutoGen are usable but not strong here, in the sense that you can layer Pydantic over them if you choose to and most production teams eventually do, but the framework is not pulling you in that direction by default. Strands is better than I expected, with type hints threaded through the primitives in a way that suggests the AWS team learned from the Python community’s long argument about static typing. OpenAI’s Agents SDK is the weakest, which is a strange place for it to land given that OpenAI invented the function-calling abstraction that made this whole conversation possible. The SDK gives you typed tools but the broader runtime is loose, and the gap shows up most visibly in error handling, where untyped exceptions surface from places you did not expect.
Observability is where I came in with a strong prior and most of it held up after a month of writing about it. LangFuse, Arize Phoenix, and LangSmith are the three open or open-ish tools I covered earlier this month, and the honest summary is that the framework you pick determines how painful or pleasant integration with any of them ends up being. LangGraph is straightforward, because the graph already represents the structure the tracer wants to record. CrewAI has gotten meaningfully better here in the last two releases, with first-class hooks for emitting span data. Strands is genuinely excellent because the AWS team treated this as a launch requirement, which is rare enough to call out. Pydantic AI is good because the contracts already exist; you mostly need to wire them through. AutoGen and the OpenAI SDK are workable but more effort than they should be, which is a frustrating thing to discover the week before a production launch.
Team-shareability is the last criterion and the one I find myself caring most about now that I have written this many posts. The question is whether a person who did not write the original agent can pick it up, understand what it does, change one thing, and ship that change in the same afternoon. CrewAI wins this in the broad case because the role-and-task abstraction reads like documentation; the cost is that the abstraction hides enough of the runtime that the same engineer cannot debug the failure modes I described earlier. LangGraph is the second-best, because the graph diagram and the code are close enough that the visualization works as onboarding material. Pydantic AI is excellent for teams that already think in types and a step harder for teams that do not. Strands is good if your team is comfortable with AWS conventions and worse if they are not. AutoGen is hardest in the team-handoff scenario because the multi-agent conversation pattern requires holding more state in your head than the other patterns do.
The verdict, which I would not write this directly outside of a piece like this, is that there is no one framework that wins. There are three I would defend on a given engagement and three I would only pick under specific conditions. LangGraph, Pydantic AI, and Strands are the three I keep coming back to in real work, and the reason is the same in each case: they were built by people who took debuggability and observability as design requirements rather than feature requests. CrewAI is the right answer when the team optimizing for it is composed of domain experts who need to express workflows without writing graph code. AutoGen is the right answer when the multi-agent conversation pattern is genuinely the shape of the problem, which is less often than people think. The OpenAI Agents SDK is the right answer when you have already committed to the OpenAI ecosystem end to end and the lock-in cost is one you have priced in honestly.
The framework choice is not the most important choice you will make about an agent system, and that is the part I want to land hardest. The model choice matters more, the evaluation harness matters more, the data layer matters more, the deployment posture matters more. The framework decides how those decisions get expressed in code. Pick the one that gets out of your way for the kind of system you are actually building, and budget the time to switch when you discover the system you thought you were building was the wrong one.
If this was useful, forward it to one engineer who needs less noise in their feed.


