Your RAG Pipeline Is an Agent Now, Whether You Know It or Not
An engineer walked me through his “RAG pipeline” last week and, by the third minute, had described query rewriting, intent classification, multi-hop retrieval across two vector stores, a Cohere reranker, a tool call into Salesforce for live opportunity data, conditional routing between summarization and direct-answer modes, and a retry loop when the structured-output validator rejected the model’s response. Then he said the latency was up and asked if I had ideas on tuning the chunker.
The chunker is not the problem. The label is.
Whatever this thing is, it stopped being a retrieval-augmented generation pipeline a while ago. The 2024 RAG playbook had four steps: embed the query, retrieve top-k, stuff context into the prompt, generate. That is not a description of any production system I have reviewed this year. The diagrams still say RAG. The runbooks still say RAG. The JIRA tickets still say RAG. The runtime behavior is something else entirely.
The line between RAG and agents was always thinner than people pretended. Retrieval is a tool call. Reranking is a control-flow decision. Conditional routing through different prompts is a loop with branching. Each of those moves, in isolation, was a small and sensible improvement on a brittle baseline. Stacked together over eighteen months, they add up to a non-deterministic loop with state, tool access, and conditional control flow. That is the working definition of an agent in every framework that ships one.
What practitioners actually built, while telling themselves they were “improving the RAG pipeline,” reads like a tour of the agent playbook. Query decomposition, because single-shot retrieval missed compound questions. Reranking with Cohere or BGE, because semantic similarity alone returned topical-but-wrong context. Multi-hop retrieval, because the answer required chaining facts across documents. Tool calls into Postgres or Snowflake, because the customer wanted live numbers and not a vector of last quarter’s PDF. Conditional branching, because some questions needed retrieval and some needed a calculator. Retry logic with structured-output validation, because the model returned malformed JSON one time in fifty and oncall got tired of paging. Together, they ship as an agent.
The honest pushback here is that this is semantics. Call it a pipeline, call it an agent, the thing still answers customer questions and the chunker still needs tuning. That argument would hold if it were not for what the framework maintainers did next. Haystack 2.x rewrote its component model so tool integration is first-class, and the Pipeline type now does what a runtime control-flow graph does. LlamaIndex Workflows added explicit state, event-driven steps, and conditional routing, then started shipping templates labeled “agent” in the docs. Neither project pivoted. They grew into the shape their users were already forcing on them. The maintainers watched the issues and PRs roll in, and all of them were about multi-step, stateful, tool-using flows. The label changed because the work changed.
This relabeling problem matters because the operational story you wrote for a deterministic retrieval pipeline does not survive contact with a non-deterministic agent loop. Tracing a RAG pipeline means logging the query, the retrieved chunks, and the prompt. Tracing an agent means logging the trajectory: every step, every tool call, every branch decision, every retry. Eval gating a RAG pipeline scores retrieval precision and answer faithfulness against a golden set. For an agent, you score trajectories instead of endpoints, because two correct answers can come from one good path and one disaster path that happened to converge. Incident response on a RAG pipeline starts at the embedder and the index. On an agent, you ask which tool call failed, which branch fired, and which retry consumed the budget. The runbooks are not interchangeable.
The cost of the mislabel shows up on a Tuesday afternoon, when oncall posts “the RAG pipeline is broken” in the incident channel and starts checking the vector store. Two hours in, they find the actual problem: a tool call to Salesforce hit a rate limit, the retry loop ate the timeout budget, and the model produced a confident answer from stale context because the fallback path did not reflect the failure upstream. None of that lives in the RAG runbook, because on paper this is a RAG system. The team learns the wrong lesson, files the wrong ticket, and the next incident has the same shape because nobody updated the mental model. That is what the label costs you.
The upgrade path here is not what people assume it is. Nobody needs to migrate to LangGraph, swap Haystack for CrewAI, or rebuild on Pydantic AI to fix this. The framework you have is almost certainly fine. The Haystack 2.x, LlamaIndex Workflows, or Semantic Kernel pipeline you already run is a reasonable agent runtime. The upgrade is admitting what you have been running, then giving it the operational treatment an agent deserves. Trace every step end to end with OpenTelemetry into Langfuse or Phoenix. Name your loops in code so the trace UI shows you a graph instead of a flat call list. Eval-gate changes with promptfoo or Braintrust against trajectory-level test cases, not single-shot Q&A pairs. Write the failure-mode catalog: what happens when the reranker times out, when the SQL tool errors, when the validator rejects, when the user asks something out of scope. Own the loops you already wrote.
Once you admit you have been building an agent, the architectural questions get easier. The “which framework should we adopt” debate stops mattering, because the framework you have already does the job and the migration cost is real. The “is our system an agent” debate stops mattering, because the answer is yes and the question is how to operate it. What remains is the only question that was ever interesting: what capabilities does your system have, and which ones do you give it next?
That is the question worth your time. The skills arc starts there.


