Vector database khác DB truyền thống thế nào? Cách nó làm similarity search nhanh?

DB truyền thống (Postgres, MySQL) lookup dựa trên exact match / range với B-tree hoặc hash index — rất nhanh cho WHERE id = ?. Không hiệu quả cho "tìm vector gần nhất với vector query trong 10 triệu vector" — brute force là O(n·d).

Vector DB chuyên cho Approximate Nearest Neighbor (ANN) search: chấp nhận kết quả xấp xỉ (95-99% recall) để đổi lấy tốc độ nhanh hơn hàng trăm lần.

Thuật toán ANN chính:

1. HNSW (Hierarchical Navigable Small World) — graph index đa tầng, mỗi node là vector, edge nối các vector gần nhau. Query: bắt đầu tầng cao (ít node), greedy traverse đến neighbor gần nhất → xuống tầng dưới, lặp đến tầng 0. Thời gian O(log n). Recall cao (>95%), memory lớn (lưu graph). Default của Qdrant, Weaviate, Elastic, pgvector.

2. IVF (Inverted File Index) — clustering k-means: chia vector vào N cluster; query chỉ compare với các vector trong top-nprobe cluster gần nhất. Nhanh, memory ít; recall thấp hơn HNSW. Faiss default.

3. PQ (Product Quantization) — chia vector thành subvector, quantize mỗi subvector thành codebook nhỏ → compress 32x. Dùng kèm IVF (IVF-PQ) cho billion-scale. Trade recall lấy memory.

4. ScaNN (Google), DiskANN (Microsoft) — tối ưu cho corpus cực lớn, chạy trên disk/SSD.

Distance metrics: cosine, dot product, L2 (Euclidean). Với embedding đã normalize, cả 3 cho kết quả tương đương về ranking.

Tính năng khác: metadata filter (pre-filter / post-filter), hybrid search (dense + sparse BM25), multi-tenancy (collection/namespace), horizontal sharding cho billion-scale.

Lựa chọn production: Pinecone (managed, dễ dùng), Qdrant (open, nhanh, rust), Weaviate (open, feature đầy đủ), Milvus (billion-scale), pgvector (nếu đã có Postgres, < 10M vector), Chroma (prototyping).

Traditional DBs (Postgres, MySQL) look up by exact match / range with B-tree or hash indexes — great for WHERE id = ?. Useless for "find nearest vectors to query among 10M vectors" — brute force is O(n·d).

Vector DBs specialize in Approximate Nearest Neighbor (ANN) search: accept approximate results (95–99% recall) in exchange for hundreds-of-times faster queries.

Main ANN algorithms:

1. HNSW (Hierarchical Navigable Small World) — multi-layer graph index, each node a vector, edges connect near-neighbors. Query: start at a sparse top layer, greedy-traverse to nearest neighbor → descend layers to layer 0. Time O(log n). High recall (>95%), heavy memory (graph storage). Default in Qdrant, Weaviate, Elastic, pgvector.

2. IVF (Inverted File Index) — k-means clustering: assign vectors to N clusters; query only compares against top-nprobe nearest clusters. Fast, memory-light; lower recall than HNSW. Faiss default.

3. PQ (Product Quantization) — split vectors into subvectors, quantize each via a small codebook → ~32x compression. Paired with IVF (IVF-PQ) for billion-scale. Trades recall for memory.

4. ScaNN (Google), DiskANN (Microsoft) — optimized for very large corpora on disk/SSD.

Distance metrics: cosine, dot product, L2 (Euclidean). With L2-normalized embeddings, all three produce equivalent ranking.

Other features: metadata filtering (pre-filter / post-filter), hybrid search (dense + sparse BM25), multi-tenancy (collections/namespaces), horizontal sharding for billion-scale.

Production picks: Pinecone (managed, easy), Qdrant (open, fast, Rust), Weaviate (open, feature-rich), Milvus (billion-scale), pgvector (if you already have Postgres, < 10M vectors), Chroma (prototyping).

Xem toàn bộ AI Engineering cùng filter theo level & chủ đề con.

Mở danh sách AI Engineering