Production Vector Store Infrastructure That Scales
Production vector search infrastructure: engine selection, embedding pipelines, index lifecycle, scaling, and operational reliability for enterprise AI retrieval.
Solutions for Retrieval Infrastructure & Vector Stores
Related Industries
Frequently Asked Questions
How much does production vector search infrastructure cost to run?
Costs depend on vector count, query volume, and whether you use managed or self-hosted infrastructure. Pinecone starts at $50/month minimum (Standard) with $8.25 per million read units and $0.33/GB/month storage. At 60-100M queries per month, self-hosting becomes 50-75% cheaper. Embedding costs run $0.01-0.02 per million tokens for API-based models (Cohere embed-v4, OpenAI text-embedding-3-small) or GPU instance costs for on-premises inference. The hidden costs are operational: HNSW index rebuilds at 160M vectors take 3-6 hours of compute, embedding model switches require re-embedding the entire corpus, and index lifecycle management (compaction, recall validation, blue-green rotation) demands dedicated engineering capacity. We scope every engagement with monthly run-rate projections covering storage, compute, embedding inference, and operational overhead.
Should we use pgvector, Qdrant, Milvus, Weaviate, or Elasticsearch for vector search?
We benchmark your actual vectors and queries against candidates before recommending. pgvector 0.8 with iterative scans delivers 471 QPS at 99% recall on 50M vectors and costs nothing extra if you already run PostgreSQL. Qdrant's scalar quantization serves billion-vector indices off NVMe SSDs at sub-20ms P95 with GPU-accelerated HNSW building for faster index construction. Elasticsearch 9.2 DiskBBQ maintains 100MB memory regardless of index size, changing the economics for very large deployments. Milvus 2.5 ships native hybrid search with GPU CAGRA indexing. Weaviate handles 50K active shards per node for high-tenant-count SaaS workloads. The right choice depends on your vector count, metadata filter complexity, multi-tenancy needs, and whether your team can operate Kubernetes clusters or needs a managed service.
Why does our vector search quality degrade over time in production?
Three common causes. First, embedding drift: your data distribution shifts but the index was built on the old distribution. Drift-Adapter techniques recover 95-99% of original performance without full rebuilds. Second, stale knowledge: documents update in the source system but the vector index still serves old embeddings because your re-indexing runs on a nightly batch instead of CDC triggers. Third, HNSW graph degradation from incremental updates. The graph structure becomes suboptimal as vectors are added and deleted over time, and recall degrades without any error signal. The fix requires automated recall validation against a golden query set, incremental re-indexing triggered by document change events, and periodic blue-green index rotation to restore graph quality.
How do we migrate between vector databases without re-embedding everything?
No standard vector data format exists, and most vector databases don't support data export in a way that preserves embeddings portably. Mainstream ETL tools like Airbyte and SeaTunnel don't handle vector migrations. If your source and target use the same embedding dimensions, you can export vectors to an intermediate Parquet or HDF5 format and reload them into the new engine without re-embedding. If you're also changing embedding models, re-embedding from source is unavoidable. We build migration tooling with dual-write capability so your production system continues serving from the old store while the new store catches up. The migration from Pinecone to self-hosted typically takes 2-4 weeks depending on vector count and metadata complexity.
How do we handle multi-tenancy and data isolation in vector search?
Weaviate's one-shard-per-tenant model is the most mature for high-tenant-count deployments: 50K active shards per node, 1M concurrent tenants on roughly 20 nodes, with tenant states (ACTIVE, INACTIVE, OFFLOADED to S3) for cost management. Milvus supports isolation at database, collection, partition, or partition key level. Both require custom orchestration at scale: tenant provisioning, state transitions, quota enforcement, and audit logging are not handled by the database itself. For SOC 2 or ISO 27001 compliance, you also need query-level access tracing and cross-tenant leakage detection. We build the operational layer around the vector store that manages tenant lifecycle and provides the audit trail regulated deployments require.
What embedding model should we use for production retrieval?
Default to evaluating against your domain-specific queries, not MTEB leaderboard rankings. Cohere embed-v4 leads multilingual retrieval at $0.01 per million tokens with 1,024 dimensions across 100+ languages. OpenAI text-embedding-3-large is a strong general-purpose option. Nomic Embed v2 at 137M parameters delivers the best quality-to-size ratio and runs on CPU, eliminating GPU inference costs. For multimodal retrieval, Qwen3-VL-2B handles text, images, and documents in a single model. Production consensus is 768-1,024 dimensions for RAG workloads. The critical consideration is switching cost: changing models later means re-embedding your entire corpus, which at 8M documents costs days of compute and requires dual-write infrastructure. We benchmark candidates against your query patterns before you commit.
How do we handle HNSW index rebuilds without downtime?
HNSW index rebuilds at 160M vectors take 3-6 hours with CPU-only hardware. GPU-accelerated HNSW building in Qdrant and Elasticsearch 9.3 (via NVIDIA cuVS) cuts this by up to an order of magnitude. But the rebuild time is only half the problem. The real challenge is rebuilding without taking the production index offline. We implement blue-green index rotation: a fresh index builds on separate infrastructure while the existing index continues serving queries. Once the build completes, automated recall validation runs against a golden query set. If recall meets threshold, traffic switches atomically. If not, the old index keeps serving and we investigate. This handles the compaction latency spike problem too, since the new index has an optimal graph structure without the fragmentation from incremental updates.
What vector infrastructure does an agentic AI system need?
Agentic workflows require multiple memory layers beyond static document retrieval: episodic memory for conversation history and intermediate reasoning, semantic search over a document corpus, and user profile or preference stores. The sub-400ms latency requirement for agentic retrieval is tighter than batch RAG, and ACID transaction support matters for multi-step agents updating state without partial writes. No single vector database handles all layers well. Production implementations use multi-store architectures: Redis or DynamoDB for session state, Qdrant or Milvus for semantic search, and a graph database for relationship tracking. Oracle's Unified Memory Core (March 2026) converges vector, graph, and relational queries in one engine. We design the retrieval infrastructure layer for agentic systems, handling query routing across stores and consistency management when agents update multiple backends in a single reasoning step.
Build Your AI with Confidence.
Partner with a team that has deep experience in building the next generation of enterprise AI. Let us help you design, build, and deploy an AI strategy you can trust.
Veriprajna Deep Tech Consultancy specializes in building safety-critical AI systems for healthcare, finance, and regulatory domains. Our architectures are validated against established protocols with comprehensive compliance documentation.