Fuzzy Full Join

When to use this

Use fuzzy_join / MultiJoiner when you have two or more arrays of records that you want to link by similarity — without a shared key. Classic use cases: product matching, entity resolution, record linkage, cross-catalogue deduplication.

rustfuzz provides a Rust-native, Rayon-parallel fuzzy join across any number of named arrays. Supports full-join and inner-join modes, pairwise and wide (pivoted) output formats.

Three complementary signals are supported and fused via Reciprocal Rank Fusion (RRF):

Channel	Input	Scoring	Auto-fallback
Text	`texts=`	BM25 Okapi + indel fuzzy, RRF	—
Sparse	`sparse=`	dot product on `{token_id: weight}`	✅ BM25 when `sparse=None`
Dense	`dense=`	cosine similarity on unit vectors	—

Quick Start

from rustfuzz.join import fuzzy_join

products  = ["Apple iPhone 14 Pro", "Samsung Galaxy S23 Ultra", "Google Pixel 7a"]
listings  = ["iphone 14 pro max",   "galaxy s23 ultra 256gb",   "pixel 7a black"]
inventory = ["Apple Iphone",        "Samsung Galxy",            "Gogle Pixle"]   # typos

rows = fuzzy_join(
    {"products": products, "listings": listings, "inventory": inventory},
    n=1,   # top-1 match per element per target array
)

for r in rows:
    print(f"{r['src_array']}[{r['src_idx']}] → {r['tgt_array']}[{r['tgt_idx']}]  "
          f"{r['src_text']!r:30} ↔ {r['tgt_text']!r}  score={r['score']:.4f}")

Output (excerpt):

products[0] → listings[0]    'Apple iPhone 14 Pro'          ↔ 'iphone 14 pro max'
products[1] → inventory[1]   'Samsung Galaxy S23 Ultra'     ↔ 'Samsung Galxy'
inventory[0] → products[0]   'Apple Iphone'                 ↔ 'Apple iPhone 14 Pro'

How Scoring Works

Reciprocal Rank Fusion (RRF)

Each active channel produces a ranking over all target documents. The final score per target is:

score = Σ  weight_channel × (1 / (rrf_k + rank_channel))
        ────────────────────────────────────────────────
                       Σ active_weights

where rrf_k = 60 (default, Cormack et al. 2009). This makes the score robust to scale differences between channels — BM25 values and cosine similarities are never directly compared, only their ranks are.

Text channel

A BM25 Okapi index is built over the target array texts once per array pair. Then for each source query:

Score all target docs with BM25 → rank by score
Score all target docs with indel fuzzy similarity → rank by score
RRF-fuse the two ranking lists

This dual-ranking combination means misspellings and abbreviations are handled robustly: if BM25 misses a misspelled term, the fuzzy rerank salvages it.

Sparse channel (auto-fallback)

When you provide explicit sparse= dict vectors (e.g. SPLADE, BM25S encodings), a merge-join dot product is computed. When no sparse vectors are provided, the channel automatically falls back to pure BM25 scores over the target text — giving you a complementary term-overlap signal at zero extra cost.

sparse provided?	text provided?	Sparse channel behaviour
✅	any	merge-join dot product on your vectors
❌	✅	BM25 score over target texts (auto-fallback)
❌	❌	channel disabled

Dense channel

Expects pre-normalised embedding vectors (unit norm). Scoring is pure dot product (= cosine similarity on unit vectors) computed in Rayon-parallel across all target docs.

API Reference

`fuzzy_join` — one-shot convenience

from rustfuzz.join import fuzzy_join

rows = fuzzy_join(
    arrays: dict[str, list[str]],           # named text arrays
    *,
    sparse: dict[str, list[dict[int, float]]] | None = None,  # named sparse arrays
    dense:  dict[str, list[list[float]]]    | None = None,    # named dense arrays
    text_weight:   float = 1.0,
    sparse_weight: float = 1.0,
    dense_weight:  float = 1.0,
    n:      int   = 1,      # top-N matches per element
    bm25_k1: float = 1.5,
    bm25_b:  float = 0.75,
    rrf_k:   int   = 60,
) -> list[dict]

All keys in sparse / dense must appear in arrays.

New parameters:

Parameter	Default	Description
`how`	`"full"`	`"full"` — return all top-N rows. `"inner"` — only rows with `score >= score_cutoff`.
`score_cutoff`	`None`	Minimum score for inner join mode (defaults to `0.0` when `how="inner"`).

`MultiJoiner` — fine-grained control

from rustfuzz.join import MultiJoiner

joiner = MultiJoiner(
    text_weight=1.0,
    sparse_weight=1.0,
    dense_weight=1.0,
    bm25_k1=1.5,
    bm25_b=0.75,
    rrf_k=60,
)

# Method-chaining API
joiner = (
    MultiJoiner()
    .add_array("A", texts=[...], dense=[...])
    .add_array("B", texts=[...], sparse=[...])
)

rows = joiner.join(n=1)                  # full join — all ordered pairs
rows = joiner.join(n=1, how="inner",     # inner join — only good matches
                   score_cutoff=0.01)
rows = joiner.join_pair("A", "B", n=3)  # single direction, top-3
rows = joiner.join_wide("A", n=1)       # pivoted — one row per A element

add_array parameters:

Parameter	Type	Description
`name`	`str`	Unique array identifier
`texts`	`list[str \\| None] \\| None`	Text per element; drives text + sparse-fallback channels
`sparse`	`list[dict[int, float] \\| None] \\| None`	Sparse vectors as `{token_id: weight}`
`dense`	`list[list[float] \\| None] \\| None`	Dense embedding vectors (pre-normalised)

At least one of texts, sparse, or dense must be non-None.

Return schema

# Pairwise long format (join / join_pair)
{
    "src_array":    str,           # source array name
    "src_idx":      int,           # index within source array
    "src_text":     str | None,    # source element text
    "tgt_array":    str,           # target array name
    "tgt_idx":      int,           # index within target array
    "tgt_text":     str | None,    # matched target text
    "score":        float,         # combined weighted RRF score
    "text_score":   float | None,  # raw text-channel RRF (or None if disabled)
    "sparse_score": float | None,  # dot-product or BM25 score (or None if disabled)
    "dense_score":  float | None,  # cosine similarity (or None if disabled)
}

# Wide/pivoted format (join_wide) — one row per source element
{
    "src_array": str,
    "src_idx":   int,
    "src_text":  str | None,
    # One pair of columns per target array X:
    "match_X":  str | list[str] | None,    # matched text (or list when n>1)
    "score_X":  float | list[float] | None, # score (or list when n>1)
}

Examples

Text-only: product matching with typos

from rustfuzz.join import fuzzy_join

products  = ["Apple iPhone 14 Pro", "Samsung Galaxy S23 Ultra", "Google Pixel 7a"]
inventory = ["Apple Iphone",        "Samsung Galxy",            "Gogle Pixle"]

rows = fuzzy_join({"products": products, "inventory": inventory}, n=1)
# inventory[0] = "Apple Iphone"  →  products[0] = "Apple iPhone 14 Pro"  ✓
# inventory[1] = "Samsung Galxy" →  products[1] = "Samsung Galaxy S23 Ultra"  ✓

Messy Company Names (Real-World Data)

When resolving entities across different databases, company names are often noisy, abbreviated, or missing legal suffixes. This is where the combination of BM25 (word overlap) and indel fuzzy (typo tolerance) shines:

import pandas as pd
from rustfuzz.join import fuzzy_join

# Our internal CRM database
crm_records = [
    "Acme Corp LLC",
    "Global Logistics International",
    "Tech Solutions Inc.",
    "Smith & Sons Hardware",
]

# External billing list with noisy data from user input
billing_list = [
    "ACME Corporation",           # expanded suffix, caps
    "Global Logistic Int.",       # missing 's', abbreviated suffix
    "TechSolution",               # missing space, missing suffix
    "Smith and Sons Hardware Co", # 'and' instead of '&', extra suffix
    "Unknown Entity",             # no match
]

rows = fuzzy_join(
    {"crm": crm_records, "billing": billing_list},
    n=1,
    how="inner",
    score_cutoff=0.015, # drop confident mismatches like 'Unknown Entity'
)

df = pd.DataFrame(rows)[["src_text", "tgt_text", "score"]]
df.columns = ["CRM Name", "Matched Billing Name", "Score"]
print(df)

Output:

                         CRM Name        Matched Billing Name     Score
0                   Acme Corp LLC            ACME Corporation  0.033333
1  Global Logistics International        Global Logistic Int.  0.032787
2             Tech Solutions Inc.                TechSolution  0.027778
3           Smith & Sons Hardware  Smith and Sons Hardware Co  0.030769

Dense-only: semantic embedding matching

import math
from rustfuzz.join import MultiJoiner

def normalize(v):
    m = math.sqrt(sum(x*x for x in v))
    return [x/m for x in v]

emb_a = [normalize([1,0,0]), normalize([0,1,0]), normalize([0,0,1])]
emb_b = [normalize([0.97,0.08,0]), normalize([0.03,0.98,0.05]), normalize([0.01,0.04,0.99])]

joiner = (
    MultiJoiner(text_weight=0, sparse_weight=0, dense_weight=1)
    .add_array("A", texts=["concept_X","concept_Y","concept_Z"], dense=emb_a)
    .add_array("B", texts=["near_X",   "near_Y",   "near_Z"],   dense=emb_b)
)

for r in joiner.join_pair("A", "B", n=1):
    print(f"{r['src_text']} → {r['tgt_text']}  cosine={r['dense_score']:.4f}")
# concept_X → near_X   cosine=0.9979
# concept_Y → near_Y   cosine=0.9988
# concept_Z → near_Z   cosine=0.9990

Text + Dense: fusing lexical and semantic signals

from rustfuzz.join import MultiJoiner

brands = ["Apple MacBook Pro", "Dell XPS 15",  "Lenovo ThinkPad X1"]
descr  = ["macbook pro m3",    "xps 15 oled",  "thinkpad x1 carbon"]

# Dummy normalized embeddings — use your real model in production
emb_brands = [[1,0,0], [0,1,0], [0,0,1]]
emb_descr  = [[0.97,0.05,0], [0.02,0.96,0.04], [0.01,0.03,0.98]]

joiner = (
    MultiJoiner(text_weight=0.4, sparse_weight=0.4, dense_weight=0.6)
    .add_array("brands", texts=brands, dense=emb_brands)
    .add_array("descr",  texts=descr,  dense=emb_descr)
)

for r in joiner.join_pair("brands", "descr", n=1):
    print(f"{r['src_text']:<22} → {r['tgt_text']:<22}  "
          f"score={r['score']:.4f}  bm25={r['sparse_score']:.2f}  cos={r['dense_score']:.4f}")

Explicit sparse vectors (SPLADE / BM25S)

Pass pre-encoded sparse vectors directly — the merge-join dot product is much faster than building a BM25 index for very large pre-encoded corpora:

from rustfuzz.join import MultiJoiner

# Token IDs from your vocabulary; weights are TF-IDF / SPLADE scores
sparse_corpus = [
    {101: 0.9, 2345: 0.7, 879: 0.4},  # doc 0
    {202: 0.85, 3310: 0.6},            # doc 1
]
sparse_query = [
    {101: 0.95, 879: 0.5},             # query 0
    {202: 0.9,  3310: 0.7},            # query 1
]

joiner = (
    MultiJoiner(text_weight=0, sparse_weight=1, dense_weight=0)
    .add_array("queries", sparse=sparse_query)
    .add_array("corpus",  sparse=sparse_corpus)
)

for r in joiner.join_pair("queries", "corpus", n=1):
    print(f"query[{r['src_idx']}] → corpus[{r['tgt_idx']}]  dot={r['sparse_score']:.4f}")

Pandas / Polars output

import pandas as pd
from rustfuzz.join import fuzzy_join

rows = fuzzy_join({"A": array_a, "B": array_b, "C": array_c}, n=1)
df = pd.DataFrame(rows)

# Best match per source element
best = df.loc[df.groupby(["src_array","src_idx"])["score"].idxmax()]
print(best[["src_array","src_text","tgt_array","tgt_text","score"]])

import polars as pl
from rustfuzz.join import fuzzy_join

df = pl.DataFrame(fuzzy_join({"A": array_a, "B": array_b}, n=3))

# Top match per pair
top1 = (
    df.sort("score", descending=True)
      .group_by(["src_array","src_idx","tgt_array"])
      .first()
)

Join Types

Full join (default)

Always returns the top-N target matches for every source element. Equivalent to a filtered cross join.

rows = fuzzy_join({"A": array_a, "B": array_b}, n=3, how="full")
# 3 elements × 2 directions × n=3 → 18 rows, all returned

Inner join

Only keeps rows where the best match is confident enough.

rows = fuzzy_join(
    {"A": array_a, "B": array_b},
    n=1,
    how="inner",
    score_cutoff=0.015,   # tune based on your data
)
# Only elements with a real match survive

Choosing score_cutoff

RRF scores depend on rrf_k and the number of active channels. A practical starting point for 2 channels with rrf_k=60 is 0.010–0.020. Inspect pd.DataFrame(rows)["score"].describe() to set a data-driven threshold.

N-Array Joins and Wide Format

MultiJoiner supports any number of arrays — add as many as you need.

Pairwise long format (default)

from rustfuzz.join import fuzzy_join

# 5 catalogues — produces 5×4 = 20 ordered pairs
arrays = {
    "source":  ["Apple iPhone", "Galaxy S23"],
    "shopA":   ["iphone 14",    "galaxy s23 ultra"],
    "shopB":   ["Apple Iphone", "Samsung Galxy"],
    "shopC":   ["iPhone Pro",   "GalaxyS23"],
    "shopD":   ["iPhne",        "galaxys 23"],
}
rows = fuzzy_join(arrays, n=1)  # 2 elements × 20 pairs = 40 rows

Wide / pivoted format (`join_wide`)

Returns one row per source element with match columns for every other array — ideal when you want a single output table:

from rustfuzz.join import MultiJoiner

catalogues = ["Apple iPhone 14", "Samsung Galaxy S23", "Google Pixel 7"]
shop_a     = ["iphone 14 pro",   "galaxy s23 ultra",   "pixel 7a"]
shop_b     = ["Apple Iphone",    "Samsung Galxy",       "Google Pixal"]
inventory  = ["Apple Inc phone", "Samsung S23",         "Pixel phone 7"]

joiner = (
    MultiJoiner()
    .add_array("catalogues", texts=catalogues)
    .add_array("shop_a",     texts=shop_a)
    .add_array("shop_b",     texts=shop_b)
    .add_array("inventory",  texts=inventory)
)

# One row per catalogue entry, with best match in each other array
rows = joiner.join_wide("catalogues", n=1)
for r in rows:
    print(f"{r['src_text']:<22}  "
          f"shop_a={r['match_shop_a']!r:<20}  "
          f"shop_b={r['match_shop_b']!r:<20}  "
          f"inventory={r['match_inventory']!r}")