CocoIndex

dbt for Vector Pipelines

Jun 27, 2026

Incremental data-to-embedding pipelines, declaratively. Only the delta ever gets reprocessed.

Your vector pipeline does the same thing every night. It reads every document from the source, chucks, embeds, and writes to the vector store. If you have ten thousand documents, you process all ten thousand. If one document changed, you process all ten thousand. If you swapped your embedding model, you process all ten thousand. The nightly reindex job that started as a tolerable twenty-minute background task becomes a six-hour grind that occupies a GPU nobody else can use, and you accept it because the alternatives are worse.

CocoIndex is the alternative. It is a Rust-core framework with a Python DSL that treats your vector index the way dbt treats your data warehouse. You declare what the index should look like as a function of the source data, and the engine figures out what to add, update, or delete. Only the delta ever gets processed. On a re-run, a single PDF that changed paths causes exactly that PDF to be re-chunked and re-embedded, not the entire corpus, and a deleted source file causes its index rows to be removed without requiring a full rebuild.

The project hit 1.0 in mid-May and is already at v1.0.14 as of this week with over ten thousand GitHub stars. The release cadence is healthy, the documentation is thorough, and the architecture is built around a genuinely pragmatic observation: incremental processing is not an optimization. It is the only way a vector pipeline survives contact with a production data set.

dbt became the standard for data transformations because it gave analysts a declarative language for what the data should look like and handled the rest automatically. You write SELECT statements, and dbt figures out which tables to build, which to refresh, and how to order the dependencies. CocoIndex does the same thing, but for embedding pipelines. You write a Python function that declares the target state, chunk a document to produce a row with a text field and an embedding vector, mount it to a Postgres table or a Qdrant collection, and the engine handles the incremental sync automatically.

The declaration model is where the similarity to dbt is most visible. You annotate your processing function with @coco.fn(memo=True), which tells the engine to cache results keyed by the hash of both the input and the function code. When you re-run, unchanged inputs with unchanged logic skip entirely. When the logic changes, say you swapped a chunking strategy or updated your embedding model, only the affected outputs are recomputed. The dependency tracking is automatic and happens at the function-call level, not the file level.

The practical difference shows up the first time you add a document to a corpus that has been running for three months. With a full-reindex pipeline, you wait. With CocoIndex, the new document is processed, embedded, and written to the target store inside of a few seconds, and the GPU that was occupied by nightly reindexes goes back to serving inference.

The connector ecosystem is one of the strongest arguments for the framework. Eighteen connectors at writing time cover local files, S3, Google Drive, OCI Object Storage, Postgres, SQLite, Qdrant, LanceDB, Neo4j, SurrealDB, Turbopuffer, ZVec, FalkorDB, Doris, Iggy, Kafka, Redis via Valkey, and Amazon S3 as both source and target. The breadth means you can wire a pipeline from a Google Drive folder of PDFs into a Postgres-backed vector index in roughly forty lines of code, and that pipeline will stay in sync automatically as long as you run it.

The @coco.fn memoization is worth pausing on, because it is the mechanism that makes everything else work. CocoIndex stores not just the output of each function call but the hash of the input and the hash of the function code itself. When both match a previous run, the cached output is returned with zero computation. When the code changes, only the functions whose code hash changed are re-executed, and even then only on the inputs that changed. This is the difference between an incremental tool that re-indexes changed files and one that re-embeds only the new chunks inside a changed file while reusing the embeddings for the unchanged ones.

The MCP server story is also worth mentioning. CocoIndex ships an optional MCP server that exposes the indexed corpus to any MCP-aware agent, which means your Claude Code instance or your Cursor agent can query the vector index as a native tool without any middleware. The flagship example is CocoIndex-code, an MCP server that builds an AST-aware semantic code index of your repository and exposes it to coding agents. It produces sub-second freshness on incremental updates, roughly seventy percent fewer tokens per turn because the agent stops flooding the context window with full-file reads, and cache hit rates in the eighty-to-ninety percent range on re-index. If you have an agent that needs to search a codebase that changes daily, that is the difference between a tool that helps and one that gets ignored.

The Rust core matters for a practical reason that most Python-only frameworks do not address. The state tracking and change detection that underlie incremental processing are metadata-sensitive operations that benefit from a compiled runtime. Rust handles the hash tracking, the dependency graph, and the connector I/O without the overhead of the GIL, which matters when your pipeline is managing hundreds of thousands of chunked documents across multiple source and target connectors simultaneously. The Python DSL layer is where you write your transformations, and the Rust engine is where the expensive work happens.

The commitment to open source is worth noting because the category has already seen tools go from open-core to monetized-API in a single funding round. CocoIndex is Apache 2.0, no dual license, no enterprise-only features. The business model appears to be the managed platform, cocoindex.io, which offers a hosted version with a GUI dashboard and team management, but the engine itself is fully open and self-hostable. That is the right model for the category, and the dbt analogy holds here as well. dbt Labs built an open-source standard and monetized the platform around it, while the dbt-core engine remained Apache 2.0. CocoIndex is following the same playbook, and for teams that want to self-host their embedding pipeline behind a firewall, the open-source engine is not a trial version missing features.

The natural audience for this tool is any team that has a nightly reindex job and has accepted the cost because nobody offered a better option. If your pipeline runs in under five minutes for the full corpus, CocoIndex is over-engineering. You do not need incremental processing for a thousand documents that re-index in ninety seconds. The threshold where incremental processing becomes the difference between a pipeline that works and a pipeline that gets disabled is around the point where the full reindex exceeds the team’s tolerance for maintenance overhead. For most teams, that is somewhere between ten thousand documents and a hundred thousand, depending on embedding model speed and GPU availability.

There is one limitation worth naming. The memoization cache is a SQLite database by default, and for single-machine pipelines it works well. If you need the cache to be shared across machines or survive a complete infrastructure rebuild, you need to configure the database path explicitly and back it up alongside your source data. The documentation covers this, but the default configuration is easier to lose than it should be for the scale of pipeline that CocoIndex is designed to serve.

The other consideration is that the framework is still young. v1.0 launched six weeks ago, and the release notes show healthy activity with genuine feature additions, not just bug fixes, but the ecosystem of community-contributed connectors, transformation libraries, and operational tooling has not had time to accumulate. If you need a connector for an obscure internal data store, you are writing it yourself. The connector API is well-documented, but it is not zero effort.

For the team whose nightly reindex takes longer than a development cycle, CocoIndex is the tool that changes the calculation. It is not a faster embedding model. It is not a more efficient chunking strategy. It is a structural change to how the pipeline works. The delta cost of adding a document goes from the full index cost to the cost of processing one document. The delta cost of changing a model goes from the full reindex to only the documents whose embeddings are stale. The nightly six-hour job becomes a nightly job that finishes before you finish your coffee. That is the value, and it is worth the setup investment.

If this was useful, forward it to one engineer who needs less noise in their feed.

Share Signal Over Noise

Signal Over Noise

Discussion about this post

Ready for more?