Index
Blazing-fast fuzzy string matching — implemented entirely in Rust.
Built entirely by AI. Designed to beat RapidFuzz.
The Story
rustfuzz started as an experiment: can an AI agent, starting from scratch, build a fuzzy-matching library that outperforms RapidFuzz — one of the best-optimised C++ string-matching libraries in the Python ecosystem?
No human wrote the Rust. No human tuned the algorithm parameters. The AI drove every iteration, read every benchmark result, and decided what to rewrite next.
The answer the AI kept coming back to: Rust + PyO3 + tight Python-boundary design.
The Development Loop
Every feature and optimisation went through the same cycle:
flowchart LR
R["🔍 Research<br>Profiler output<br>& algorithm gaps"]
B["🦀 Build<br>Rust core<br>via PyO3"]
T["✅ Test<br>All tests must pass<br>before proceeding"]
BM["📊 Benchmark<br>vs RapidFuzz<br>& record results"]
RP["🔁 Repeat<br>Find the next<br>bottleneck"]
R --> B --> T --> BM --> RP --> R
style R fill:#6366f1,color:#fff,stroke:none
style B fill:#a855f7,color:#fff,stroke:none
style T fill:#ef4444,color:#fff,stroke:none
style BM fill:#22c55e,color:#fff,stroke:none
style RP fill:#f59e0b,color:#fff,stroke:none
Each iteration asked:
- Research — where is the remaining Python overhead? What does the profiler show?
- Build — move that hot path into Rust. Eliminate copies, reduce allocations, avoid iterator protocol overhead.
- Test — the full test suite must pass before proceeding. No broken correctness, no skipped edge cases.
- Benchmark — run head-to-head comparisons vs RapidFuzz. Numbers don't lie.
- Repeat — the next bottleneck is always waiting.
Why This Matters
RapidFuzz is exceptional — its C++ core, SIMD intrinsics, and decades of optimisation make it a formidable target. The goal of this project was never to dismiss it, but to prove that:
- AI can drive non-trivial systems programming — not just generate boilerplate.
- Rust + PyO3 can match C++ at the Python boundary — with the added safety guarantees Rust provides.
- Iterative AI-driven optimisation works — each benchmark loop produced measurable gains.
Features
| ⚡ Blazing Fast | Core algorithms in Rust — no Python overhead, no GIL bottlenecks |
| 🧠 Smart Matching | ratio, partial_ratio, token sort/set, Levenshtein, Jaro-Winkler, and more |
| 🔒 Memory Safe | Rust's borrow checker — no segfaults, no buffer overflows |
| 🐍 Pythonic API | Typed Python interface — import rustfuzz.fuzz as fuzz and go |
| 📦 No Build Step | Pre-compiled wheels for Python 3.10–3.14 on Linux, macOS, and Windows |
| 🏔️ Big Data Ready | Excels in 1 Billion Row Challenge benchmarks, crushing high-throughput tasks |
| 🔍 3-Way Hybrid Search | BM25 + Fuzzy + Dense embeddings via RRF — 25ms at 1M docs, all in Rust |
| 📄 Document Objects | First-class Document(content, metadata) + LangChain compatibility |
| 🧩 Ecosystem Integrations | BM25, Hybrid Search, and LangChain Retrievers for Vector DBs |
| 🎯 Retriever | Batteries-included SOTA search — auto-selects BM25, embeddings (OpenAI/Cohere/HF), and reranker |
Installation
Quick Example
import rustfuzz.fuzz as fuzz
from rustfuzz.distance import Levenshtein, JaroWinkler
from rustfuzz import process
# Similarity ratios
fuzz.ratio("hello world", "hello wrold") # ~96.0
fuzz.partial_ratio("hello", "say hello world") # 100.0
fuzz.token_sort_ratio("fuzzy wuzzy", "wuzzy fuzzy") # 100.0
# Edit distance
Levenshtein.distance("kitten", "sitting") # 3
JaroWinkler.similarity("martha", "marhta") # ~0.96
# Batch matching
process.extractOne("new york", ["New York", "Newark", "Los Angeles"])
# ('New York', 100.0, 0)
3-Way Hybrid Search
from rustfuzz.search import Document, HybridSearch
docs = [
Document("Apple iPhone 15 Pro Max", {"brand": "Apple", "price": 1199}),
Document("Samsung Galaxy S24 Ultra", {"brand": "Samsung", "price": 1299}),
Document("Google Pixel 8 Pro", {"brand": "Google", "price": 699}),
]
hs = HybridSearch(docs, embeddings=[[1, 0, 0], [0.9, 0.1, 0], [0.1, 0.9, 0]])
# Typo-tolerant + semantic search — all in Rust
results = hs.search("appel iphon", query_embedding=[1, 0, 0], n=1)
text, score, meta = results[0]
print(f"{text} — ${meta['price']}")
# Apple iPhone 15 Pro Max — $1199
Custom BM25 variants via fluent builder
You can seamlessly construct a HybridSearch model using any of the advanced BM25 variants (BM25L, BM25Plus, BM25T) via the .to_hybrid() builder method:
from rustfuzz.search import BM25L
results = (
BM25L(docs, delta=0.5, b=0.8)
.to_hybrid(embeddings=embeddings)
.filter('brand = "Apple"')
.match("iphone pro", n=10)
)
Cookbook Recipes 🧑🍳
| Recipe | Description |
|---|---|
| Introduction | Get started — basic matching and terminology |
| Advanced Matching | Partial ratios, token sorts, score cutoffs |
| Benchmarks | Head-to-head speed comparisons vs RapidFuzz |
| Vector DB Hybrid Search | BM25 + dense embeddings with Qdrant, LanceDB, FAISS & more |
| LangChain Integration | Use rustfuzz as a LangChain Retriever |
| Real-World Examples | Entity resolution, deduplication & production patterns |
| Fuzzy Full Join | Multi-array fuzzy joins with MultiJoiner & RRF fusion |
| 3-Way Hybrid Search | BM25 + Fuzzy + Dense via RRF — Document & LangChain support |
| EmbedAnything | Rust-native embeddings — dense + sparse, no PyTorch needed |
| Retriever | Batteries-included SOTA search — auto-selects BM25, embeddings & reranker |
Start exploring from the navigation menu on the left!