Comet Opik
The Open-Source Eval Platform Shipping Multiple Releases a Week
Tracing, evaluation, and optimization in one Apache-2.0 stack. No SaaS lock-in. Releases you can set your watch by.
Most LLM evaluation platforms ship quarterly. Opik ships multiple times a week. I am not exaggerating for effect. Between June 9 and June 17 of this year, the GitHub releases page logged ten versions: 2.0.59 through 2.0.68. That is a release roughly every twenty hours across eight days. The version number alone, 2.0.68 as of this writing, tells you something about the maintenance philosophy. This is not a project that ships one big release and goes quiet for a quarter while the backlog piles up. This is a project where the commit history and the release feed are the same shape. For an eval platform, where stale metric implementations produce false confidence, that cadence is not a nice-to-have. It is the difference between knowing your production behavior and guessing at it.
Opik comes from Comet, the ML experiment tracking company that has been shipping monitoring tooling since before the current LLM wave had a name. The branding overlap is both a strength and a visibility tax. Comet’s name recognition in classical ML circles is solid. The company has been tracking experiments, logging metrics, and managing model registries for years. But the LLM-specific product, Opik, lives in the shadow of that history. People who know Comet for TensorBoard-style experiment tracking do not immediately connect it to a tracing-plus-evaluation platform for Claude and GPT-4o. That disconnect is unfortunate. The LLM product is substantially more capable than the word-of-mouth footprint suggests.
The architecture is a self-hosted server plus a client SDK. The server runs on Docker Compose for development or Kubernetes with a Helm chart for production. The SDK is Python-first with TypeScript and Ruby support through OpenTelemetry. Install the Python package, run opik configure to point it at your server or Comet’s cloud, and every LLM call, tool invocation, and agent step gets logged as a trace with span-level detail. The integration list is extensive. LangChain, LlamaIndex, CrewAI, Autogen, Google ADK, Flowise AI, and Anthropic’s native SDK all get first-class trace logging with a single import or callback. The design assumption is that you are already using one of these frameworks and Opik should not make you change your code to get observability. That assumption is correct.
The evaluation surface is where Opik distinguishes itself from pure observability tools. The platform ships a suite of LLM-as-a-judge metrics: hallucination detection, moderation scoring, answer relevance, context precision, and context recall. These are the same metrics you find in dedicated evaluation libraries. The difference is that Opik runs them against the same trace data it collects during normal operation. You do not export your traces to a separate evaluation pipeline. You define an experiment against a dataset, point it at the traces already in the system, and get scored results back in the same UI that shows your production monitoring dashboards. For a team that is currently running zero evals because the tooling friction is too high, that unification is the difference between a plan and a habit.
The dataset management is genuinely useful in a way that most eval platform dataset features are not. You can create datasets from production traces, which means your eval data is drawn from real user interactions rather than synthetic prompts you hope are representative. You can annotate spans with feedback scores through the SDK or the UI, which means your domain experts can label outputs without learning a new annotation tool. And you can run experiments that compare different prompts, models, or configurations against the same dataset with side-by-side results. The flow is: log traces from production, promote the interesting ones to a dataset, annotate them, and run experiments against that dataset whenever you change a prompt or swap a model. That cycle, repeated weekly, is what continuous evaluation actually looks like in practice. Most teams never get past step one because the tooling does not make the loop obvious.
The pytest integration is the bridge between the evaluation platform and CI. Opik lets you define evaluation tests as pytest functions, which means your hallucination checks run alongside your unit tests on every push. The same metrics that power the dashboard evaluations (hallucination, answer relevance, context precision) are available as test assertions. This is the pattern DeepEval popularized, and Opik’s implementation is clean. The value proposition is slightly different than DeepEval’s, though. DeepEval is a test framework that happens to have trace visualization. Opik is a trace-plus-evaluation platform that happens to have a pytest plugin. If your primary need is CI integration with minimal infrastructure, DeepEval is the leaner pick. If your primary need is a unified observability-plus-evaluation surface with CI as one use case among several, Opik covers more ground. The frameworks are not redundant. They are different answers to the question of where evaluation lives in your stack.
The Opik Agent Optimizer is the feature that points toward where the platform is going. It is a dedicated SDK that takes your logged traces and automatically optimizes prompts and tool configurations using DSPy-style techniques. The optimizers run against your actual production data, not synthetic benchmarks, which means the improvements are grounded in the behavior your users actually see. This is the logical endpoint of the unified platform thesis: if you are already collecting traces, running evaluations, and managing datasets, the optimizer closes the loop by feeding evaluation results back into prompt improvements automatically. It is still an early feature. The documentation is thinner than the core tracing and eval docs, and the optimizer catalog is smaller than what dedicated optimization frameworks provide. But the architecture is right. The data the optimizer needs already lives in the platform. The missing piece is breadth, not approach.
The self-hosting story deserves attention because it is the thing that determines whether Opik is a tool you adopt or a service you evaluate. Opik is Apache 2.0. Every feature in the open-source release runs on your own infrastructure with no license tier gating. Comet runs a cloud version at comet.com that adds managed hosting, and the cloud signup flow is prominent in the documentation. But the self-hosted path is documented, tested, and actively maintained. The Docker Compose setup boots an instance in minutes. The Kubernetes Helm chart handles production deployments. The containers run as non-root users. The service profiles let you start only the infrastructure components if you want to develop against a minimal stack. This is not a vendor that open-sourced a stripped-down version to drive cloud signups. This is a vendor that ships the full product under Apache 2.0 and monetizes the hosting.
The release cadence is the strongest signal about long-term health. Opik shipped ten releases in the eight days before this article was written. The changelog is not a list of cosmetic fixes. Recent releases added guardrails support, online evaluation rules for production monitoring, service profiles for Docker Compose, and non-root container security. The integration catalog added Google ADK, Autogen, AG2, and Flowise AI in the same window. This is not a project coasting on a 1.0 launch. This is a project where the CI pipeline is running hot and the maintainers are shipping faster than most teams can evaluate the releases. For an eval platform, that is the right problem to have.
The thing to watch is the optimizer gap. Opik Aspire, the agent optimizer, is a promising feature with a thin implementation surface. The auto-optimization capability is the differentiator that could make Opik the default platform for teams building agentic systems, but it needs more optimizer types and better documentation before it crosses from “interesting demo” to “production workflow.” The core tracing and evaluation surface is already production-grade. The optimizer is the bet on where the field goes next. If Comet invests heavily in that feature over the next quarter, Opik becomes the only platform that unifies tracing, evaluation, and automated optimization in a single self-hostable stack. If it languishes, Opik remains a solid evaluation platform that competes directly with Langfuse and Lunary on tracing and with DeepEval on pytest integration, but does not break out of the comparison set.
The Comet brand tax is real and probably permanent. Opik will likely always be introduced as “Comet Opik” and people will likely always need to be told that the Comet that tracks ML experiments also builds an LLM evaluation platform. In a market where Langfuse has mindshare and DeepEval has the clever pytest hook, Opik has to work harder for attention. The product earns it. Nineteen thousand GitHub stars say the attention is coming whether or not the branding is efficient. The question is whether the optimizer lands before the next wave of platforms ships their own versions. The trajectory says yes. The release cadence says the team has the velocity to get there. The architecture says the pieces are already in place. The only missing piece is time, and at a release every twenty hours, time is on Opik’s side.
If this was useful, forward it to one engineer who needs less noise in their feed.


