OpenLLMetry
The Right Answer When You Already Have Datadog
One Traceloop.init() call and your agent traces show up next to your HTTP spans. Your platform team does not need a separate observability stack for AI.
Every AI observability product on the market asks you the same question: how do you want your traces exported? They offer a dropdown of twenty destinations, build first-class integrations for a few hosted backends, and leave the rest as community-contributed collectors that may or may not survive a version upgrade. The question hiding underneath that dropdown is whether you want a second observability stack. Most teams say no without realizing it. They install an agent tracing SDK, wire it to its own hosted dashboard, and now they have two places to look when something breaks. One for HTTP request latency and database query times. Another for token counts, tool calls, and generation spans. The two dashboards do not talk to each other. When an agent call is slow because the downstream API was slow, the correlation lives in the engineer’s head, not in the tooling. That is not a minor inconvenience. It is the failure mode that makes AI observability a separate category instead of a dimension of existing observability.
OpenLLMetry is the tool that solves that problem by not asking the question. It is an Apache-2.0 suite of OpenTelemetry instrumentations for LLM providers, vector databases, agent frameworks, and MCP servers, all built and maintained by Traceloop, a Y Combinator company that exited stealth in 2023 and now ships new releases weekly. The Python SDK alone has over 7,200 stars on GitHub and is actively maintained by a company whose business depends on the library being good, not on locking you into their proprietary dashboard. The npm version is at parity and ships the same model of standard OpenTelemetry spans to any collector. If your platform team already runs an observability stack, OpenLLMetry is not competing with it. It is feeding it.
The surface area is the widest in the open-source LLM observability landscape. OpenAI, Anthropic, Gemini, Mistral, Groq, Ollama, Together, Bedrock, SageMaker, Vertex AI, Replicate, Hugging Face, Watsonx, Aleph Alpha, Writer. Vector databases: Chroma, Pinecone, Qdrant, Weaviate, Milvus, LanceDB, Marqo. Agent frameworks: LangChain, LangGraph, LlamaIndex, CrewAI, Haystack, Agno, OpenAI Agents, AWS Strands. Protocol support: MCP. Twenty-three LLM providers, seven vector stores, nine frameworks. Each instrumentation is a standalone OpenTelemetry package under the opentelemetry-instrumentation- namespace, which means you can install only what you need or install the SDK and get all of them. The SDK is a convenience wrapper that bundles the common instrumentations and exposes a single Traceloop.init() call. You install traceloop-sdk from PyPI, call Traceloop.init() at startup, and your agent’s token usage, tool call latency, tool call failures, vector search times, and generation spans all flow into the same OTLP endpoint your HTTP services already use. The span attributes use the OpenTelemetry semantic conventions for LLM operations that Traceloop proposed to the OpenTelemetry project and got accepted as an official standard. Your traces are not vendor-specific. They are the standard.
The practical difference this makes is hard to appreciate until you have spent a morning in Langfuse trying to correlate an agent failure with a downstream API timeout. Langfuse is a good product. I have used it in production. The traces are clean, the evaluations are useful, and the free tier is generous. But when your agent crashes on step 37 and you need to know whether the OpenAI call hung, the Pinecone query returned empty, or the embedding service returned a 503, Langfuse shows you the agent trace with all the LLM spans, and your Datadog dashboard shows you the HTTP status codes and the latency percentiles. The correlation is manual. You look at the trace timestamp, switch tabs, look at the Datadog spike, switch back, try to match the span IDs. It is entirely possible to do and entirely wasteful. With OpenLLMetry, the agent spans land in Datadog or Honeycomb or Grafana or Splunk or any of the twenty-three supported destinations alongside every other span in your system. The agent trace and the HTTP trace are the same trace. The Pinecone latency percentile for this agent’s query is a dimension on the same dashboard as the P99 for the API endpoint that triggered the agent. You do not switch tabs.
The 0.61.0 release, which shipped at the end of May, added structured-output tracing for OpenAI’s responses.parse() method and GenAI semantic convention compliance for the OpenAI Agents SDK. Both additions track the direction of the ecosystem. More teams are using structured output for agent tool calls, which means the traces need to capture what was parsed and what was returned. More teams are building with OpenAI Agents, which means the spans need to report the prompt instructions, the tool call latency, the cache read inputs, and the reasoning tokens. OpenLLMetry captures all of that in standard OpenTelemetry span attributes. The v0.61.0 release also fixed a batch of exception recording bugs across LangChain, Anthropic, Groq, Mistral AI, Bedrock, Ollama, SageMaker, Together, Chroma, LanceDB, Weaviate, and Pinecone. When a provider returns a rate-limit error or a vector store query times out, the span now records the exception and sets an ERROR status instead of silently dropping the event. That is the kind of fix that does not make a changelog sound exciting but makes the difference between traces you can debug from and traces you have to guess at.
The vector store instrumentations deserve a separate mention because they solve a problem that most agent teams discover around month three of production. Your retrieval-augmented generation pipeline is the most latency-sensitive part of your agent stack. A query that takes 40ms in isolation takes 200ms under load because the vector store connection pool is exhausted. You cannot debug that without traces that include the vector store query time and the embeddings count and the result count. OpenLLMetry emits all of those as standard span attributes, consistently across Pinecone, Milvus, Qdrant, Weaviate, Chroma, and LanceDB. The 0.61.0 release fixed a bug where the attribute names were inconsistent across providers. Now they are uniform. A db.system attribute plus db.vector.dimension_count and db.vector.result_count on every query span. You do not need to know which vector store is backing your agent to write a dashboard filter that catches slow queries.
The comparison that matters is not OpenLLMetry versus Langfuse or OpenLLMetry versus Arize or OpenLLMetry versus any of the dedicated LLM observability platforms. Those products are better than OpenLLMetry if you do not already have an observability stack. They have nicer UIs, they have built-in evaluation dashboards, they have dataset management and annotation workflows. If you are starting from scratch and need an observability solution for your AI application, Langfuse is the right answer. But if your organization already runs Datadog or Grafana or Honeycomb or New Relic or Splunk or Azure App Insights or GCP Cloud Trace or any of the other sixteen backends OpenLLMetry supports, the right answer is to feed that stack. Your platform team maintains it. Your on-call engineers are already trained on it. Your dashboards and alerts and SLOs already live there. Adding a second observability stack because the AI team likes the agent trace viewer is the kind of decision that seems reasonable in a sprint planning session and turns into a maintenance burden six months later when nobody remembers which dashboard to check first.
The tradeoff is real. OpenLLMetry’s output is standard OpenTelemetry spans with semantic convention attributes. It does not have an agent trace viewer. It does not have a built-in evaluation UI. It does not have a dataset management interface. The span attributes for evaluations like hallucination detection or answer relevance require the separate Traceloop evaluator package and a custom exporter. If you need the full AI-observability product with the evaluation dashboards and the dataset management, OpenLLMetry alone is not that product. It is the plumbing that makes your existing observability stack understand AI workloads. Most teams need that plumbing more than they need another UI.
Traceloop the company is worth understanding because the sustainability model matters for an open-source tool you are considering for production. Traceloop operates a hosted platform at traceloop.com, the same model MongoDB or Grafana or Sentry use. The open-source library is the on-ramp and the integration layer. The commercial product is a hosted OTel backend with AI-specific dashboards, evaluations, and alerts. This is one of the healthier open-source company models. Traceloop wins when you adopt OpenLLMetry and later decide you want their managed backend. You win because the library is genuinely useful without their backend and works with any OTLP-compatible destination. Apache-2.0 license means no license change risk. The company is incentivized to keep the library excellent because the library is their lead generator. That alignment is more durable than the model where the open-source library is crippled without the paid tier.
The evaluation arc that runs through this month has covered Inspect AI, DeepEval, Comet Opik, and HUD. Each one solves a different piece of the evaluation problem. Inspect AI gives you a government-grade solver-scorer abstraction for multi-step agent tasks. DeepEval gives you pytest-style unit tests for LLM outputs. Comet Opik gives you a full evaluation platform with dataset management and hosted judges. HUD gives you an environment-driven agent evaluation protocol with real shell and browser capabilities. OpenLLMetry does not compete with any of them. It sits underneath them, providing the trace data that makes the evaluations interpretable. You run your agents through HUD, you collect rewards, and OpenLLMetry captures the per-step traces so you can see why the agent with the high reward did what it did. You run DeepEval assertions in CI, and OpenLLMetry captures the evaluation context so you can trace a failing assertion back to the model call that produced the output. The tools are additive. OpenLLMetry is the layer that connects them to the rest of your observability.
If you already maintain an observability stack, the choice is straightforward. Install the one integration that routes your LLM and agent traces into your existing infrastructure. Your on-call engineers will never need to learn a second dashboard.
If this was useful, forward it to one engineer who needs less noise in their feed.


