What May 2026 Told Us About the AI Tools That Actually Ship
Thirty days, thirty posts, and the patterns that survived contact with real work.
I wrote thirty articles this month about the tools my teams actually touch when we build AI systems for paying customers, and the most useful thing I can do on the last day of May is tell you which patterns survived the writing and which ones quietly fell apart on the page. I started the month thinking the story was going to be about agent frameworks, because that is where the noise is loudest and where the largest number of vendors are trying to take your attention. The story turned out to be about something else. The story turned out to be that the framework layer is the least interesting decision in an AI stack right now, and the interesting decisions are happening one layer up and one layer down from where everyone is looking.
The first week was the framework spotlight arc: Strands, LangGraph, CrewAI, AutoGen, Pydantic AI, the OpenAI Agents SDK, with the Mastra and Semantic Kernel honorable mentions to round it out. I went in with the prior that the field was overcrowded and came out with the same prior reinforced. The frameworks that survived my own scrutiny were the ones built by people who took debuggability and observability as design requirements from the first commit, not features added in the second year of the project. That filter eliminated a startling amount of the field. The frameworks that earned their place on a real engagement, in my notes from this month, were LangGraph, Pydantic AI, and Strands. The others have their cases. None of them are the case I would defend on a Monday morning standup with a CTO asking why we picked what we picked.
The second week was the eval arc, and that is the week the through-line of the month actually surfaced. Promptfoo, Braintrust, LangSmith, Langfuse, and Arize Phoenix all got their day, and what became obvious by the end of the week is that the field has not yet decided whether evaluation is a development-time activity or a runtime activity. The vendors who treat it as a runtime concern, which is most of them, end up building observability tools that happen to score outputs. The vendors who treat it as a development-time concern, like Promptfoo and increasingly Braintrust, end up building something closer to a test suite for non-deterministic systems. The two camps are not in competition, even though they look like they are on the landing pages. They are solving different problems for the same team. A production AI system needs both, and the teams that figured this out first are the teams that are shipping the most reliably right now. Most teams have not figured it out, which is why the eval gap is the largest open problem on my list heading into June.
The third week was the older-but-relevant arc, and that one mattered for a reason I did not see coming when I sketched the calendar. Haystack, LlamaIndex, Semantic Kernel, DSPy, and the RAG-to-agent evolution take on the closing day all served the same purpose in retrospect: they were the reminder that the agent ecosystem did not start in 2024, and the tools that have been quietly maturing for three or four years are often the ones with the production stories the new entrants do not have yet. Haystack runs in places where the new frameworks will not survive a procurement review. LlamaIndex has been doing agent work under a different label for longer than most of its competitors have existed. DSPy is the only tool in the field treating prompt construction as a compiler problem rather than a string-formatting exercise, and that bet is going to look smarter in eighteen months than it does today. The lesson of the week, which I want to underline because it is genuinely contrarian in the current moment, is that vendor age is not a liability in this field. Vendor age is the closest thing we have to a proxy for production hardening.
The fourth week was the personal arc, and that is the one I felt the most uncertain about heading into the month and the most confident about coming out of. The arc was about built-in tools, Claude Skills, the skills-as-npm-packages pattern, and the Composio interrogation that closed it out. The argument that emerged across the seven posts, which I want to state directly because it is the argument I will be making for the rest of the year, is that the agent ecosystem is currently confusing two different distribution problems and treating them as one problem. The first distribution problem is how a developer ships a capability to an agent runtime. The second distribution problem is how an organization governs which capabilities its agents are allowed to use. The current generation of registry-style products, Composio being the cleanest example, are solving the second problem by pretending it is the first. That works until you try to put an agent through a SOC 2 audit, at which point the registry model collapses into something the auditor cannot understand and the security team cannot underwrite. The package model, where skills ship as npm or PyPI artifacts with the same supply-chain controls as the rest of your code, solves both problems at once because both problems are already solved at the package layer. Most teams are not going to land on this until the audit cycle forces them to. That is fine. The teams that figure it out before the audit are the teams that get to keep their velocity through Q3.
The framework scorecard yesterday was the cross-tool retrospective, and the thing I want to call out from it now that I have a day of distance is that the criteria I used to score the frameworks were the same criteria I would use to score anything else in this stack. Debuggability, lock-in, type safety, observability, and team-shareability are not framework-specific. They are the criteria that should be applied to every AI tooling decision a team makes, including the ones that do not look like framework decisions, like which observability vendor to standardize on or which eval harness to integrate into CI. The reason most teams end up with the AI stack they regret is that they apply rigor to the framework decision and then default to whatever is easiest on every decision after it. The framework is the most visible choice, which is exactly why it is the wrong place to spend most of your judgment budget.
The pattern that emerged across all four weeks, which I did not see clearly until I sat down to write this post, is that the AI tools that actually ship are the ones that took some part of the production lifecycle seriously from day one. Strands took observability seriously. Pydantic AI took type contracts seriously. Promptfoo took the test-suite metaphor seriously. Langfuse took self-hosting seriously. Haystack took enterprise procurement seriously. DSPy took the compiler metaphor seriously. None of those tools win on every axis. All of them win on the axis they bet on, and that is the axis that matters when the tool meets a real production constraint. The tools that try to win on every axis at once are the tools you have not heard of in six months, because the strategy of being slightly above average on everything is the strategy that loses to the strategy of being excellent at the one thing your customer actually feels.
June is going to look different from May for two reasons. The first is that the framework arc has been written, and I do not see another tool spotlight worth thirty days of attention emerging at the same pace; the field is consolidating and the spotlights will become rarer and more reluctant. The second is that the open problems I identified across the month, the eval gap, the distribution-versus-governance confusion, the production-hardening lag in newer tools, are not problems that get solved by writing about one more framework. They get solved by writing about the architectural patterns that sit above the framework layer, and that is where I am pointing the calendar for the next month. Expect posts about the shape of an AI stack that survives a regulated-industry deployment, the patterns for running evals in CI without bankrupting the org on inference costs, and a hard look at what changes when the model layer becomes commodity faster than the tooling layer does.
The month did not change my mind about anything I came in believing. It sharpened the conviction that the interesting work in this field is not happening at the framework layer, and it gave me a stack of examples to point at when someone asks me why. The framework debate is the conversation the vendors want you to be having, because that conversation is good for them. The conversation that is good for you is the one happening at the eval layer, the distribution layer, and the production-hardening layer, and that is the conversation I am here to keep having.
If you read every post this month, thank you. If you read one, thank you for that too. The next thirty days are going to be more opinionated than the last thirty, because the last thirty earned me the right to be.
If this was useful, forward it to one engineer who needs less noise in their feed.


