Production Vector Store Infrastructure That Scales

Production vector search infrastructure: engine selection, embedding pipelines, index lifecycle, scaling, and operational reliability for enterprise AI retrieval.

Your Vector Store Works in the Demo. Production Is a Different Problem.

The gap between a vector search proof-of-concept and a production system that handles real query volume without degrading is almost entirely an infrastructure problem. The demo loads 50K vectors into Pinecone, runs a cosine similarity query, and returns results in 40ms. Then you go to production: 200M vectors, 500 queries per second sustained, 15 metadata filter dimensions, documents updating hourly, and three teams sharing the same cluster with strict tenant isolation. Now you're dealing with HNSW compaction spikes blowing up your P99 to 800ms, recall silently degrading after a week of incremental updates, and an embedding pipeline that can't keep up with your document change rate. This is where we work.

We build and operate the retrieval infrastructure layer that sits underneath your RAG pipeline, your agentic workflows, or your semantic search product. That means the vector engine itself, the embedding pipeline feeding it, the index lifecycle management keeping it healthy, and the observability stack catching quality degradation before users notice. We're vendor-neutral on the engine. We've deployed Qdrant, Milvus, Weaviate, pgvector, and Elasticsearch kNN in production, and the right choice depends on your vector count, query patterns, multi-tenancy requirements, and operational capacity.

Engine Selection Is a Benchmarking Exercise, Not a Brand Decision

The vector database market hit $2.55 billion in 2025 and is growing at 27.5% annually. Every vendor claims best-in-class performance. The benchmarks tell a more nuanced story. pgvector 0.8 with iterative scans delivers 471 QPS at 99% recall on 50M vectors on Aurora PostgreSQL. Qdrant's scalar quantization serves billion-vector indices off NVMe SSDs at sub-20ms P95 with 27K+ GitHub stars backing an aggressive release cadence. Elasticsearch 9.2's DiskBBQ maintains a 100MB memory footprint regardless of index size, which fundamentally changes the cost model for large-scale deployments. Milvus 2.5 ships native hybrid search (full-text plus vector) in a single engine with GPU-accelerated CAGRA indexing.

We don't pick a database from a feature matrix. We load your actual vectors, run your actual queries with your actual metadata filters, and measure recall at operationally relevant k values alongside P50/P95/P99 latency under concurrent load. The results regularly contradict vendor marketing. Pinecone's serverless tier looks cost-effective until sustained high-QPS workloads push read unit costs past self-hosting breakeven. pgvector looks limiting until you realize iterative scans solved the filtered search problem that made dedicated engines necessary in the first place. Weaviate's multi-tenancy handles 50K active shards per node and 1M concurrent tenants on roughly 20 nodes, but operational complexity at that scale demands specific expertise.

The Embedding Pipeline Is Where Production Retrieval Actually Breaks

Teams spend 60% of their vector search engineering effort on the pipeline, not the store. The pipeline handles document ingestion, chunking, embedding model inference, incremental re-indexing on document updates, and metadata propagation. Every one of these stages has failure modes that silently degrade retrieval quality.

Stale knowledge is the second most common production RAG failure. A document gets updated in Confluence but the vector index still serves the old embedding. The fix is change-data-capture triggers that detect document modifications and re-embed incrementally, not batch re-indexing on a nightly schedule. Ghost documents are equally insidious: your source document gets deleted but the vector remains in the index, returning results for content that no longer exists. Atomic transactions across a source-of-truth system and a vector store are nearly impossible in split architectures, so we build reconciliation layers that detect and purge orphaned vectors.

Embedding model selection matters more than most teams realize, and switching later is expensive. Re-embedding 8M documents when you upgrade from text-embedding-ada-002 to text-embedding-3-large costs days of compute and requires dual-write infrastructure to avoid downtime. We evaluate embedding models against your domain-specific query set before you commit. Cohere embed-v4 leads multilingual retrieval at $0.01 per million tokens across 100+ languages. Nomic Embed v2 at 137M parameters runs on CPU with the best quality-to-size ratio in the market. OpenAI's text-embedding-3-large remains a strong all-rounder. The right choice depends on your language mix, latency requirements, and whether you can accept API dependency or need on-premises inference.

Index Lifecycle: The Operational Problem Nobody Warns You About

HNSW indices degrade. This is not a bug; it's an architectural reality. At 160M vectors, a full HNSW rebuild takes 3-6 hours. Incremental updates cause recall degradation over time as the graph structure becomes suboptimal. Compaction events spike query latency. Every upsert, delete, and segment merge triggers sub-index rebuilds that burn CPU on maintenance instead of serving queries. The recall-latency tradeoff is inescapable: pushing from 0.8 to 0.95 recall increases HNSW latency by roughly 31%.

We engineer index lifecycle management that handles continuous ingestion without quality degradation. Blue-green index rotation lets us rebuild indices on separate infrastructure and swap atomically with zero downtime. Automated recall validation gates compare a golden query set against the current index after every major operation. If recall drops below threshold, the swap doesn't happen. GPU-accelerated HNSW building in Qdrant and Elasticsearch cuts rebuild times by an order of magnitude, but the orchestration around when to rebuild, how to validate, and how to swap is custom engineering.

Quantization extends this further. Qdrant now offers 1.5-bit, 2-bit, and asymmetric quantization options. Elasticsearch BBQ reduces heap by over 95% compared to float32. These save memory and improve throughput, but each quantization scheme has a different recall profile against different data distributions. We characterize the recall impact on your specific vectors before deploying quantization in production.

Multi-Tenancy and Isolation at Real Scale

Platform teams serving 200+ internal ML teams or SaaS products with thousands of customer tenants need vector infrastructure that guarantees isolation. Tenant A's queries must never return tenant B's data. Audit logs must trace every query to a tenant identity. Cold tenants shouldn't consume resources that hot tenants need.

Weaviate's one-shard-per-tenant model with tenant states (ACTIVE, INACTIVE, OFFLOADED to S3) is the most mature implementation for high-tenant-count deployments. Milvus supports isolation at database, collection, partition, or partition key level, giving you flexibility to match isolation granularity to your compliance requirements. But both require custom orchestration at scale: tenant provisioning, state transitions, quota management, and cross-tenant leakage detection are not handled by the database itself. We build the operational layer around the vector store that makes multi-tenancy manageable for regulated deployments requiring SOC 2 or ISO 27001 compliance.

Agentic Workflows Are Reshaping Infrastructure Requirements

The agentic AI wave changes what vector stores need to do. Static RAG retrieves documents against a fixed corpus. Agentic workflows require episodic memory (conversation history and intermediate reasoning), semantic search over a large document corpus, and user profile layers, often hitting all three in a single agent step. The sub-400ms latency requirement for agentic retrieval is tighter than batch RAG, and ACID transaction support is becoming essential for multi-step agents that need to update state without partial writes.

No single vector database handles all three memory layers well. Teams are assembling multi-store architectures: Redis for session state, Qdrant or Milvus for semantic search, and a graph database for relationship tracking. Oracle's Unified Memory Core (March 2026) attempts to converge vector, JSON, graph, relational, and spatial queries in one engine. We design the retrieval infrastructure layer for agentic systems: determining which stores handle which memory types, how queries route across stores, and how the system maintains consistency when an agent updates state across multiple backends in a single reasoning step.

What We Deliver

Every engagement starts with a benchmarking phase where we load your vectors, run your queries, and produce quantified engine recommendations. From there, we build the production infrastructure: the vector store cluster with capacity planning and scaling runbooks, the embedding pipeline with CDC-triggered incremental re-indexing and model version management, the index lifecycle automation with blue-green rotation and recall validation gates, the multi-tenancy orchestration layer if you need tenant isolation, and the observability stack with drift detection, recall regression alerts, and P95/P99 latency monitoring. We also build the migration tooling for teams moving between engines, using intermediate Parquet formats to preserve embeddings where dimensionality matches rather than forcing a full re-embed. We scope every engagement with explicit monthly infrastructure cost projections so you know what production will cost before you commit.

FAQ

Frequently Asked Questions

How much does production vector search infrastructure cost to run?

Costs depend on vector count, query volume, and whether you use managed or self-hosted infrastructure. Pinecone starts at $50/month minimum (Standard) with $8.25 per million read units and $0.33/GB/month storage. At 60-100M queries per month, self-hosting becomes 50-75% cheaper. Embedding costs run $0.01-0.02 per million tokens for API-based models (Cohere embed-v4, OpenAI text-embedding-3-small) or GPU instance costs for on-premises inference. The hidden costs are operational: HNSW index rebuilds at 160M vectors take 3-6 hours of compute, embedding model switches require re-embedding the entire corpus, and index lifecycle management (compaction, recall validation, blue-green rotation) demands dedicated engineering capacity. We scope every engagement with monthly run-rate projections covering storage, compute, embedding inference, and operational overhead.

Should we use pgvector, Qdrant, Milvus, Weaviate, or Elasticsearch for vector search?

We benchmark your actual vectors and queries against candidates before recommending. pgvector 0.8 with iterative scans delivers 471 QPS at 99% recall on 50M vectors and costs nothing extra if you already run PostgreSQL. Qdrant's scalar quantization serves billion-vector indices off NVMe SSDs at sub-20ms P95 with GPU-accelerated HNSW building for faster index construction. Elasticsearch 9.2 DiskBBQ maintains 100MB memory regardless of index size, changing the economics for very large deployments. Milvus 2.5 ships native hybrid search with GPU CAGRA indexing. Weaviate handles 50K active shards per node for high-tenant-count SaaS workloads. The right choice depends on your vector count, metadata filter complexity, multi-tenancy needs, and whether your team can operate Kubernetes clusters or needs a managed service.

Why does our vector search quality degrade over time in production?

Three common causes. First, embedding drift: your data distribution shifts but the index was built on the old distribution. Drift-Adapter techniques recover 95-99% of original performance without full rebuilds. Second, stale knowledge: documents update in the source system but the vector index still serves old embeddings because your re-indexing runs on a nightly batch instead of CDC triggers. Third, HNSW graph degradation from incremental updates. The graph structure becomes suboptimal as vectors are added and deleted over time, and recall degrades without any error signal. The fix requires automated recall validation against a golden query set, incremental re-indexing triggered by document change events, and periodic blue-green index rotation to restore graph quality.

How do we migrate between vector databases without re-embedding everything?

No standard vector data format exists, and most vector databases don't support data export in a way that preserves embeddings portably. Mainstream ETL tools like Airbyte and SeaTunnel don't handle vector migrations. If your source and target use the same embedding dimensions, you can export vectors to an intermediate Parquet or HDF5 format and reload them into the new engine without re-embedding. If you're also changing embedding models, re-embedding from source is unavoidable. We build migration tooling with dual-write capability so your production system continues serving from the old store while the new store catches up. The migration from Pinecone to self-hosted typically takes 2-4 weeks depending on vector count and metadata complexity.

How do we handle multi-tenancy and data isolation in vector search?

Weaviate's one-shard-per-tenant model is the most mature for high-tenant-count deployments: 50K active shards per node, 1M concurrent tenants on roughly 20 nodes, with tenant states (ACTIVE, INACTIVE, OFFLOADED to S3) for cost management. Milvus supports isolation at database, collection, partition, or partition key level. Both require custom orchestration at scale: tenant provisioning, state transitions, quota enforcement, and audit logging are not handled by the database itself. For SOC 2 or ISO 27001 compliance, you also need query-level access tracing and cross-tenant leakage detection. We build the operational layer around the vector store that manages tenant lifecycle and provides the audit trail regulated deployments require.

What embedding model should we use for production retrieval?

Default to evaluating against your domain-specific queries, not MTEB leaderboard rankings. Cohere embed-v4 leads multilingual retrieval at $0.01 per million tokens with 1,024 dimensions across 100+ languages. OpenAI text-embedding-3-large is a strong general-purpose option. Nomic Embed v2 at 137M parameters delivers the best quality-to-size ratio and runs on CPU, eliminating GPU inference costs. For multimodal retrieval, Qwen3-VL-2B handles text, images, and documents in a single model. Production consensus is 768-1,024 dimensions for RAG workloads. The critical consideration is switching cost: changing models later means re-embedding your entire corpus, which at 8M documents costs days of compute and requires dual-write infrastructure. We benchmark candidates against your query patterns before you commit.

How do we handle HNSW index rebuilds without downtime?

HNSW index rebuilds at 160M vectors take 3-6 hours with CPU-only hardware. GPU-accelerated HNSW building in Qdrant and Elasticsearch 9.3 (via NVIDIA cuVS) cuts this by up to an order of magnitude. But the rebuild time is only half the problem. The real challenge is rebuilding without taking the production index offline. We implement blue-green index rotation: a fresh index builds on separate infrastructure while the existing index continues serving queries. Once the build completes, automated recall validation runs against a golden query set. If recall meets threshold, traffic switches atomically. If not, the old index keeps serving and we investigate. This handles the compaction latency spike problem too, since the new index has an optimal graph structure without the fragmentation from incremental updates.

What vector infrastructure does an agentic AI system need?

Agentic workflows require multiple memory layers beyond static document retrieval: episodic memory for conversation history and intermediate reasoning, semantic search over a document corpus, and user profile or preference stores. The sub-400ms latency requirement for agentic retrieval is tighter than batch RAG, and ACID transaction support matters for multi-step agents updating state without partial writes. No single vector database handles all layers well. Production implementations use multi-store architectures: Redis or DynamoDB for session state, Qdrant or Milvus for semantic search, and a graph database for relationship tracking. Oracle's Unified Memory Core (March 2026) converges vector, graph, and relational queries in one engine. We design the retrieval infrastructure layer for agentic systems, handling query routing across stores and consistency management when agents update multiple backends in a single reasoning step.

Build Your AI with Confidence.

Partner with a team that has deep experience in building the next generation of enterprise AI. Let us help you design, build, and deploy an AI strategy you can trust.

Veriprajna Deep Tech Consultancy specializes in building safety-critical AI systems for healthcare, finance, and regulatory domains. Our architectures are validated against established protocols with comprehensive compliance documentation.