341x Faster: Optimizing a Rust Vector Database to Top-7 on ANN-Benchmarks


I built a vector database in Rust with a Python API and benchmarked its HNSW search engine against every implementation on ann-benchmarks.com. It started at 52 queries per second on SIFT-1M — the standard benchmark dataset from Aumüller et al. (2020). After 11 optimizations, two bug fixes, and one embarrassing measurement error, it does 17,746 QPS — projected top-7 on the ann-benchmarks SIFT leaderboard and the highest-ranked pure Rust implementation.

The full progression:

StageQPSRecall@10Multiplier
HashMap baseline520.991x
Integer indexing6480.9912.5x
AVX2/FMA SIMD1,8340.9935x
Bug fixes + algorithmic2,6860.99552x
AVX-512 + all optimizations17,7460.786341x

The headline 341x comes from the lowest-recall (ef=10) operating point, where QPS is highest. At matched recall (~0.995), the optimization is 62x; at 0.999 recall it’s 34x. Full curve below. All CPU benchmarks on AWS r6i.16xlarge (Intel Xeon Ice Lake, 64 vCPU, AVX-512) — the same hardware and single-threaded configuration used by ann-benchmarks.com. The PR is submitted.


The starting point

arrwDB is a vector database with a FastAPI frontend and a Rust core exposed via PyO3. The HNSW index — implementing the algorithm from Malkov & Yashunin (2020) — stored vectors in HashMap<String, Vec<f32>> and graph nodes in HashMap<String, HNSWNode>. Every neighbor visit during search required a string clone, a hash computation, and an RwLock acquisition.

52 QPS on SIFT-1M (1M vectors, 128 dimensions). For context, hnswlib — the reference C++ HNSW implementation — does thousands.


Phase 1: Low-hanging fruit (52 → 1,834 QPS)

Integer indexing (12.5x)

Replaced all HashMaps with contiguous arrays. Vectors became a flat Vec<f32> indexed by usize. Graph neighbors became Vec<Vec<usize>>. String-to-integer mapping happens once at the PyO3 boundary, never in the search loop.

52 → 648 QPS.

Explicit AVX2/FMA intrinsics (2.8x)

The distance computation hot path went through three stages: naive iterator .zip().map().sum(), then 8-wide unrolled scalar with 4 independent accumulators (auto-vectorized), then explicit _mm256_fmadd_ps intrinsics. The final version processes 32 floats per loop iteration with fused multiply-add — the same technique documented in Intel’s optimization manual for throughput-bound workloads. A function pointer set at construction time eliminates per-call match metric dispatch. Bounds checks on neighbor and alive arrays are eliminated with get_unchecked.

648 → 1,834 QPS.

What didn’t work

OptimizationExpectedActualWhy
Co-located memory layout+10-15%-10%At 128d, each node spans 14 cache lines. Prefetching 2 only warms the neighbor list + 20% of the vector. Separate contiguous storage lets the hardware sequential prefetcher handle vector reads.
Visited array L1 prefetch+5%-10%The 2MB visited array (u16 for 1M entries) fits in L2. Forcing entries to L1 evicted vector data from L1 cache.
Cached greedy distance+3-5%noiseEffect invisible in ~12% cloud VM variance on GCP.

At this point I was competitive with FAISS-HNSW (Johnson et al., 2021) but 1.5x behind hnswlib on a GCP VM. The problem: cloud VMs have too much noise to measure micro-optimizations. I needed the real hardware.


The recall ceiling

I rented an AWS r6i.16xlarge — the exact machine ann-benchmarks uses. First SIFT-1M run: 1,923 QPS at 0.991 recall. Encouraging. But recall plateaued at 0.992 no matter what I tried.

ef_constructionef_search=200ef_search=800ef_search=2000
4000.99170.99200.9919
8000.99190.99200.9917
12000.99180.99170.9918

Tripling ef_construction made zero difference. The ceiling was independent of every parameter. Every other algorithm on ann-benchmarks hits 0.999. Something was structurally wrong.

Bug 1: Measuring against the wrong ground truth

SIFT-1M ships with ground truth neighbors computed using L2 distance on raw vectors. My benchmark pipeline was normalizing vectors (L2 normalization) and searching with cosine distance — then comparing results against the original L2 ground truth. After normalization, about 0.8% of neighbor rankings change. The recall ceiling was in the measurement, not the algorithm.

Fix: Use raw vectors with metric="l2". One line in the dataset config.

Bug 2: Selecting 2*M neighbors instead of M

My HNSW construction selected m_max (2*M = 96) initial neighbors for each new node. In hnswlib’s mutuallyConnectNewElement, m_max is only the overflow capacity — new nodes get M (48) neighbors via getNeighborsByHeuristic2, and existing nodes can grow up to m_max from reverse connections. Selecting 96 upfront floods the graph with redundant clustered connections at the expense of the diverse long-range shortcuts that Algorithm 4 in Malkov & Yashunin (2020) is designed to produce.

Fix: Change select_neighbors_heuristic(&vectors, candidates, m_max, metric) to select_neighbors_heuristic(&vectors, candidates, self.m, metric) in both build_bulk and insert_node.

Result of both fixes: 0.992 → 0.9994 recall. Build time dropped from 2,651s to 1,589s (1.7x faster — less work per insertion with fewer connections).


Phase 2: Closing the gap (1,923 → 17,746 QPS)

With recall fixed, I stacked six optimizations. Each was benchmarked independently on the r6i.16xlarge.

Figure 1 — Optimization Progression (SIFT-1M, r6i.16xlarge)

QPS at ef=100 (~0.995 recall) after each optimization. The recall fix changed the graph structure, so the baseline shifted.

OptimizationQPS (ef=100)QPS (ef=200)Recall@200Build (1M)Delta
Baseline (bugs fixed)1,9531,1350.99926.5 min
+ Remove backfill2,5011,4060.99917.9 min+28%
+ Early termination2,6861,5120.99916.3 min+7%
+ 4-acc SIMD + worst_dist3,0581,7150.99915.0 min+14%
+ AVX-512 + prefetch 3-ahead3,2171,7930.99914.0 min+5%

Remove backfill in neighbor selection (+28%)

The diversity heuristic in select_neighbors_heuristic — Algorithm 4 from the HNSW paper — checks whether each candidate is closer to the query than to any already-selected neighbor. Candidates that fail this check are “discarded.” My implementation backfilled discarded candidates to guarantee exactly M neighbors per node. hnswlib doesn’t — if diversity rejects a candidate, it stays rejected.

Removing backfill creates sparser graphs with better long-range shortcuts. Search reaches 0.999 recall at lower ef values, directly translating to higher QPS. Build speed improved 50% because sparser graphs mean less work during construction.

Early termination (+7%)

hnswlib checks “is the best remaining candidate worse than the worst result?” before exploring a candidate’s neighbors. I was checking after. The difference: one full round of neighbor distance computations on the final iteration of every search query.

4-accumulator SIMD + tighter threshold (+14%)

The L2 AVX2 kernel went from 2 accumulators (16 floats/iter) to 4 accumulators (32 floats/iter), improving instruction-level parallelism — each FMA has 4-cycle latency but 0.5-cycle throughput on Ice Lake, so 4 independent chains keep the execution units full. Combined with updating worst_dist after every heap pop inside the neighbor loop — the tighter threshold rejects more candidates before computing their distances. This change also moved from M=48 to M=32 (33% fewer neighbors per hop).

AVX-512 + prefetch 3-ahead (+5%)

The r6i.16xlarge has Intel Ice Lake with AVX-512. Four 512-bit accumulators process 64 floats per loop iteration — for 128d SIFT, that’s 2 iterations instead of 4 with AVX2. Prefetching was extended from 1 neighbor ahead to 3 to better hide the ~70ns memory latency against the ~15ns L2 distance computation time.


Final results

Figure 2 — Recall vs QPS (SIFT-1M, r6i.16xlarge)

arrwDB’s Pareto frontier on SIFT-1M compared to reference implementations. All benchmarks on r6i.16xlarge, single-threaded. arrwDB data from this work; hnswlib and FAISS-HNSW curves are approximate (interpolated from ann-benchmarks.com published results). Official head-to-head results pending — PR #626 has been submitted to ann-benchmarks; this chart will be updated with exact comparison data once the benchmarks are run on their infrastructure.

SIFT-1M (128d, L2)

efRecall@10QPSLatency p50
100.78617,7460.06ms
500.9775,7350.18ms
1000.9953,2170.32ms
2000.9991,7930.58ms
8000.9995561.86ms

Deep-1M (96d, angular)

efRecall@10QPSLatency p50
100.78520,0820.05ms
500.9716,4700.16ms
1000.9913,6380.28ms
2000.9982,0380.50ms
8000.99996311.60ms

GloVe-1.2M (200d, angular)

efRecall@10QPSLatency p50
100.5208,3720.11ms
500.7592,7610.35ms
1000.8281,5820.64ms
4000.9214732.18ms
8000.9532524.11ms

Projected ann-benchmarks ranking (SIFT-1M at 0.999 recall)

RankSystemQPSLanguage
1qsgngt~7,000C++
2NGT-qg~4,300C++
3glass~2,400C++
4-6NGT-onng, pynndescent, n2~1,800-2,400C++/Python
7arrwDB1,793Rust
8hnswlib~1,400C++
9FAISS-HNSW~1,200C++

At ~0.95 recall (interpolated), arrwDB reaches ~7,700 QPS — roughly 1.4x faster than hnswlib (~5,500) and ScaNN (~5,400) at the same recall. At ~0.9 recall it reaches 11,540 QPS. The top 3 algorithms use quantized graph techniques — compressed neighbor representations and product quantization during search — that arrwDB hasn’t implemented. That’s the next frontier.

Competitor QPS values are interpolated from published ann-benchmarks results on the same hardware (r6i.16xlarge, single-threaded). arrwDB’s ann-benchmarks submission is pending — exact rankings will be updated once official results are available.


Cost

ResourceHoursCost
GCP n2-highmem-16 (Phase 1)~20 hrs~$55
GCP g2-standard-8 + L4 GPU~4 hrs~$12
AWS r6i.16xlarge (Phase 2)~24 hrs~$97
Total~$164

What I’d do differently

  1. Benchmark correctly from day one. The normalization bug cost days chasing a phantom recall ceiling. Always verify your ground truth matches your distance metric.

  2. Use dedicated hardware sooner. GCP shared VMs have ~12% variance from noisy neighbors. I reverted “regressions” that were just noise. The r6i.16xlarge gave consistent numbers.

  3. Don’t over-connect the graph. My instinct was “more connections = better recall.” The opposite is true — fewer, more diverse connections give both better recall and faster search.


Conclusion

  1. arrwDB’s HNSW engine went from 52 to 17,746 QPS (341x) on SIFT-1M through 11 optimizations over two hardware phases.
  2. Two correctness bugs (ground truth mismatch, neighbor selection count) were responsible for a hard recall ceiling at 0.992 that no parameter tuning could overcome.
  3. The highest-impact single optimization was removing backfill in neighbor selection (+28% QPS, +50% build speed, no recall loss).
  4. AVX-512 on Ice Lake provides a measurable but modest gain over AVX2 (~5-8%) — the algorithmic optimizations dominated.
  5. Projected top-7 on ann-benchmarks SIFT, ahead of hnswlib and FAISS-HNSW. The PR is submitted.

The HNSW algorithm was introduced by Malkov & Yashunin (2020), building on earlier navigable small-world graphs from Malkov et al. (2014). The reference C++ implementation is hnswlib, which has been the de facto standard for ANN benchmarking since its release. FAISS (Johnson et al., 2021) includes an HNSW variant alongside IVF and PQ methods. The ann-benchmarks framework (Aumüller et al., 2020) provides the standardized evaluation methodology used here.

The neighbor selection heuristic (Algorithm 4 in the HNSW paper) is a diversity-aware pruning strategy that favors long-range shortcuts over clustered connections. The impact of backfill on graph quality — the largest single optimization in this work — does not appear to be documented in the literature; hnswlib’s keepPrunedConnections parameter controls this behavior but defaults to false.

For SIMD optimization of distance computation, Intel’s Intrinsics Guide documents the AVX2/FMA and AVX-512 instructions used. The 4-accumulator technique for hiding FMA latency is standard practice in high-performance computing; see Agner Fog’s optimization manuals for the microarchitectural analysis.


11 optimizations benchmarked on AWS r6i.16xlarge (Intel Xeon Platinum 8375C, AVX-512), single-threaded, M=32, ef_construction=400. Total compute cost: $164. Source code and benchmark results at github.com/bledden/arrwDB.