The Layer Nobody Builds
What Goes Between Your LLM and Your Users
The first line of enterprise defense is not a model. It is a gateway layer that most teams skip until the security review forces them to build it.
Every enterprise AI deployment I have reviewed starts the same way. A team demoes a prototype. The prototype impresses leadership. Someone authorizes a production push. And then the security team asks the question nobody prepared for: “What happens when someone sends a bad input to your model?”
The answer, more often than not, is nothing. The model endpoint is wide open. No rate limiting. No content filtering. No PII redaction. No cost attribution. No audit trail. The API key that the prototype embeds in a client-side config file is the same API key that routes to the production endpoint. The team built the model integration. They did not build the layer that protects it.
That layer is an AI gateway, and it is the most important piece of infrastructure your enterprise AI deployment is probably missing.
I do not mean a reverse proxy that happens to route to a model provider. I mean a dedicated middleware layer that handles the eight things every production AI system needs and almost no prototype includes. Rate limiting per user and per tenant so one team cannot burn the budget for the whole organization. Content safety filtering on both input and output so the model neither receives nor returns something it should not. PII redaction before data reaches the model provider, because sending PHI or PII to a third-party inference endpoint without contractual data protection is a compliance violation waiting to be found. Prompt injection detection because the most dangerous input to an LLM looks like a normal user message to every other layer of your stack. Cost tracking and budget enforcement so you know who spent what before the invoice arrives. Credential management and virtual keys so you can rotate, revoke, and audit access per application. Request and response logging with full traceability for incident investigation. Failover routing across model providers so a provider outage does not take down your application.
These are not nice-to-haves. They are the difference between a demo that impresses and a system that passes a security review. Every team I have watched skip this layer has eventually built it under duress during an incident. The duress version is always more expensive and less complete.
The failure mode is always the same. Someone internal discovers they can make the model produce something it should not by carefully phrasing a prompt that bypasses the basic instruction-following guardrail. Or a cost spike appears on a credit card bill and nobody knows which team or which user or which application caused it. Or a compliance audit exposes that prompts containing PHI were sent to a third-party API endpoint with no contractual protections for the data in transit. Each of these events triggers a scramble to build the gateway layer that should have been there from day one. And each scramble produces something worse than a proper gateway.
What comes out of the scramble is almost always a custom middleware script that handles one or two of the eight items. A rate limiter wedged into the application code, parameterized by user ID from the session token but with no tenant-level fallback. PII detection bolted on as a regex pass in the orchestration layer, catching Social Security numbers but missing medical record numbers. Logging that goes to a separate system with no correlation between the prompt and the response, so when someone asks what a user sent and what the model returned, the answer requires manual cross-referencing across two data stores. The team builds exactly the parts that failed in the incident and leaves the other six vulnerabilities unaddressed. The next incident will involve one of those six.
This pattern does not discriminate by organizational maturity. I have seen it in startups with ten engineers and I have seen it in enterprises with dedicated ML platforms and AI infrastructure teams. The gateway is a blind spot because it falls between organizational silos. The ML team owns the model and its behavior. The platform team owns the infrastructure it runs on. The security team owns the policies that govern data handling and access. Nobody owns the layer that enforces the security policies against the model calls at runtime. The gap is organizational, not technical.
The tools exist. They are not the bottleneck. LiteLLM provides the most widely deployed open-source gateway with one hundred plus provider integrations, built-in rate limiting, spend tracking, and virtual keys. Portkey combines gateway and observability into one self-hostable platform with SOC 2 compliance, cost tracking per user and per model, and prompt management. NeMo Guardrails from NVIDIA brings content safety, topical boundaries, and factual consistency through a conversational dialog scripting language called Colang. Each takes a different approach to the same problem, and all of them are production-grade. The one you choose matters far less than the fact that you choose one at all.
The organizational pattern is harder than the tool selection. Someone needs to own the gateway layer in the same way someone owns the API gateway. It needs to be shared infrastructure that every AI-consuming application routes through, not a separate implementation deployed independently by each application team. It needs to be configured by central policy and maintained by platform engineering. The moment you let each team build their own gateway wrapper is the moment you lose the audit trail, the cost attribution, and the consistent security posture that made you want a gateway in the first place.
The right time to introduce this layer is before the first production endpoint goes live. Not during the security review six weeks later when someone in a shared calendar invite asks the question that kills the deployment timeline. The cost of retrofitting a gateway is not in the tool setup. It is in the application migration. Once authentication, logging, and rate limiting are baked into application code, extracting them into a shared layer becomes a refactor that nobody has the time or budget for. The applications stay pointed directly at the model provider, and the gateway never gets adopted.
The cost of not having it is worse. Every day your model endpoint is exposed without a gateway is a day someone can discover it, probe it, and send a prompt that makes it past your application-level guardrails. The attack surface for LLM applications is not the model’s training data. It is the input pipeline that feeds the model, and that pipeline has no security if it has no gateway. Prompt injection does not exploit a vulnerability in the model weights. It exploits the absence of a detection layer between the untrusted input and the model’s instruction-following mechanism. That detection layer lives in the gateway or it does not exist at all.
Over the next six days, I am going to walk through the tools that fill this layer and the architecture that connects them. LiteLLM for the gateway itself, covering rate limiting, virtual keys, and the provider routing that keeps applications running through an outage. NeMo Guardrails for content safety and topical boundaries, because rate limits alone do not prevent a model from producing something the legal team will need to explain. Portkey for the teams that want gateway and observability in one deployment, with the tradeoff of vendor consolidation against best-of-breed flexibility. A dedicated look at prompt injection detection as the specific security gap that most enterprise teams have not modeled in their threat surface. And a composable reference architecture that puts the full stack together, from LiteLLM at the edge through NeMo Guardrails in the middle to the inference infrastructure at the back, with PII redaction, cost attribution, and audit logging wired through each layer.
The argument for the week, and for the month, is simple. The gateway is not optional. It is not an enterprise requirement that applies only to regulated industries. It is the layer that turns a prototype into something that survives contact with security, compliance, and real users at scale. If it is not in your architecture yet, the month you spend retrofitting it is going to cost more than the week it would have taken to include it at the start. That is not speculation. I have seen the retrofitting cost more than a dozen times, and I have never once seen a team regret building the gateway first.
If this was useful, forward it to one engineer who needs less noise in their feed.


