TL;DR: Conventional vector-DB benchmarks show high QPS because they run on static data and premium hardware. In production the same queries burn far more CPU, memory, and network, inflating cloud spend. A cost-aware benchmark that records per-query resource usage under realistic load reveals the true economic picture.
Key Takeaways - Lab-only QPS numbers hide the CPU-seconds and network-GB a query actually costs. - Scaling dataset size and concurrency changes cache hit rates, I/O patterns, and recall, which in turn drives spend. - A reproducible, production-ready framework that records cloud-cost metrics lets you pick a vector DB that is fast and affordable.
The Mirage of Lab-Only Numbers

What if the impressive graphs you see are hiding a costly surprise? Public benchmark reports love tidy charts: “10 M vectors, 100 concurrent users, high QPS.” The numbers look impressive, but the setup is far from a live service. Vendors typically: - Load the entire dataset into RAM before the run. - Warm the index once, then freeze it for the duration of the test. - Use dedicated, high-end instances that most SaaS teams cannot afford.
The result is a performance ceiling that never sees the spikes of a real workload. When a new batch of vectors arrives, the index must be updated, caches are evicted, and latency tails appear. Those tails force engineers to over-provision resources just to keep SLAs intact.
[Why static tests mislead](https://example.com/why-static-tests-mislead) explains how this static mindset masks real-world pain points.
Why the static approach collapses in production
- Cache invalidation. Adding a small fraction of new vectors per hour pushes older entries out of the LRU cache, causing page faults.
- Index rebuild latency. Rebalancing the index consumes CPU cycles that compete with query threads, raising p99 latency.
- Network churn. Distributed shards must exchange updated metadata, adding billable egress.
Each of these mechanisms adds hidden cost that a pure QPS metric cannot capture.
Why the Easy Fix - Scaling Dataset Size - Still Falls Short
A common reaction is “let’s test with 10 M instead of 1 M vectors.” Bigger data does expose more I/O, but it doesn’t solve the core problem. Two hidden dynamics surface: - Cache churn. At 1 M vectors the working set may fit in RAM, yielding low latency. At 10 M the same RAM holds only a fraction, causing frequent page faults. - Concurrency explosion. Going from 1 to 100 simultaneous users changes the request pattern from sequential to bursty. Latency percentiles (p95, p99) often double or triple, even if average QPS stays flat.
Continuous ingestion adds another twist. Each new vector forces the index to rebalance, which can: - Reduce recall if the index isn’t fully merged. - Spike CPU usage as background compaction runs alongside foreground queries.
Concrete illustration of cache pressure
1# Simulate a 10 M vector store on a large-memory instance2docker run -d --name vectordb \3 -e VDB_DATASET_SIZE=10000000 \4 -e VDB_MEMORY_LIMIT=large \5 vectordb/image:latest
Run a warm-up query set, then monitor page-fault rates:
1pid=$(docker exec vectordb pgrep vectordb)2cat /proc/$pid/status | grep -i "faults"
When the fault count climbs sharply, the instance is thrashing. Adding a second node halves the fault rate because the working set splits across memory pools. The cost impact shows up as extra instance hours.
Step-by-step to expose concurrency effects
- Start with 1 concurrent client and record p50 latency.
- Ramp to 10 clients; note the increase in CPU-seconds per query.
- Jump to 100 clients; capture p99 latency and memory pressure.
- Finally, fire a burst of many clients for a short period; observe tail spikes.
These steps reveal that scaling data size alone does not guarantee realistic cost insight.
The Missing Variable: Cloud Cost per Query Under Real-World Load
Cost is a function of three resources that scale non-linearly under load:
- CPU/[GPU](/ai-ml-training) cycles. Vector similarity calculations dominate compute time.
- Memory pressure. Cache miss rates lift memory-GB-hour consumption.
- Network egress. Distributed queries pull vectors across nodes; each hop adds billable bandwidth.
Benchmarks that stop at “high QPS” ignore the exponential cost curve that appears once latency tails force you to add more nodes. The hidden spend shows up as: - Higher instance counts to keep p99 latency under a target. - Larger memory footprints to avoid swapping. - Increased outbound traffic as queries span more shards.
The article [VAST Data cost-aware optimization](https://example.com/vast-cost-aware) details how focusing on these resource axes turns a fast-but-expensive system into a cheap, reliable one.
A Production-Ready Benchmarking Playbook

Below is a step-by-step framework you can run in any cloud. It blends the workload realism of VDBBench with explicit cost instrumentation.
Step 1 - Define realistic workloads
1workload:2 dataset_sizes: [1_000_000, 10_000_000]3 concurrency: [1, 10, 100, 250]4 ingestion_rate: moderate # steady data-growth during the run5 query_mix: - top_k: 106 filter_selectivity: low - top_k: 1007 filter_selectivity: moderate
Step 2 - Instrument cloud cost
Capture three metrics per query using provider APIs.
AWS example (Bash + CLI)
1#!/usr/bin/env bash2# Tag resources for the benchmark run3aws ec2 create-tags --resources $(cat instance_ids.txt) \4 --tags Key=Benchmark,Value=VectorDB56# After the run, pull cost data7aws ce get-cost-and-usage \8 --time-period Start=$(date -d '-1 day' +%Y-%m-%d),End=$(date +%Y-%m-%d) \9 --granularity DAILY \10 --filter '{"Tags":{"Key":"Benchmark","Values":["VectorDB"]}}' \11 --metrics "UnblendedCost" "UsageQuantity" \12 > cost_report.json
For GCP, enable Billing Export to BigQuery and query the `gcp_billing_export` table for the same tag.
Step 3 - Run VDBBench-style scenarios -
Step 4 - Record latency distribution
Collect p50, p95, p99 per scenario. Correlate spikes with the cost metrics pulled in Step 2.
1import pandas as pd2latency = pd.read_csv('latency.csv')3costs = pd.read_json('cost_report.json')4merged = latency.merge(costs, on='timestamp')5print(merged[['p99','cpu_seconds','network_gb']].describe())
Step 5 - Normalize cost per successful query
1cost_per_query = total_cost / successful_queries
Target a high recall; discard runs that fall below.
Step 6 - Compare across implementations
Run the same script against Pinecone, Milvus, Qdrant, Weaviate, and a native VAST Data index. Store results in a unified dashboard. - Comparison axes - CPU-seconds / query - Memory-GB-hours / query - Network-GB / query - Cost per query @ high recall
The open-source repo [VDBBench](https://github.com/example/vdbbench) already provides the ingestion pipeline and query driver; you only need to add the cost-capture wrappers.
What Happens When You Benchmark for Cost, Not Just Speed
When cost becomes a first-class metric, several benefits emerge: - Predictable OPEX. Teams right-size clusters, avoiding the over-provisioning that blind QPS chasing creates. - Performance stability. Latency tails shrink because capacity matches real workload patterns instead of a best-case curve. - Strategic vendor negotiation. Concrete cost-per-query data gives you leverage to demand better pricing tiers or SLA guarantees.
These outcomes echo lessons from real-world deployments where cost-focused observability turned chaotic spend into disciplined budgeting.
Frequently Asked Questions
Q: How do I add cost measurement to an existing vector DB benchmark?
A: Enable the cloud provider’s billing export, tag the benchmark’s compute resources, and aggregate CPU-seconds, memory-GB-hours, and network-GB per test run. Divide the total cost by the number of successful queries to get cost-per-query.
Q: Does VDBBench support [streaming](/data-engineering) data ingestion?
A: Yes. VDBBench includes a configurable ingestion pipeline that can simulate a steady data-growth rate while queries run concurrently.
Q: Which vector DB tends to have lower cost per query at larger scales?
A: Public cost-aware benchmarks show that VAST Data’s native index can achieve a large improvement over Milvus 2.6 while maintaining high recall, highlighting the impact of native vector indexes on cost efficiency.
Q: Can I use this benchmarking approach for on-prem deployments?
A: Absolutely. Replace cloud cost metrics with power-usage (kWh) and hardware depreciation calculations to derive an equivalent cost-per-query figure.
Q: What concurrency level should I test for a typical SaaS AI product?
A: Start with 1, 10, and 100 concurrent users; then add a burst test of several hundred users to surface tail-latency and cost spikes that matter in production.
Ready to see the real cost behind your vector searches?
Sources
Research and references cited in this article:
- How to Evaluate Vector Databases in 2026 - Actian Corporation
- Vector Database Benchmarks: A Definitive Guide to Tools, Metrics, and Top Performers
- [VDBBench 1.0: Real-World Benchmarking for Vector Databases
- Milvus Blog](https://milvus.io/blog/vdbbench-1-0-benchmarking-with-your-real-world-production-workloads.md)
- Best Vector Databases in 2026: A Complete Comparison Guide
- zilliztech/VectorDBBench: Benchmark for vector databases. - GitHub
- What is a vector database & how does it work? | Google Cloud
- 10. Scaling Vector Databases to Billions of Data Points: The Ultimate Guide
- The Hidden Cost of Vector Database Pricing Models
- Optimizing Vector Database Costs at Enterprise Scale - LinkedIn
- Best Vector Databases 2026: 6 Top Picks Compared & Tested
- How to Make a Vector Database Work for Your Enterprise
- The Architecture Behind Our 11× Vector Benchmark - VAST Data
