The 1B Challenge¶
A story of memory management, algorithmic trade-offs, and the satisfying click of watching numbers improve.
At the core of rusket is the belief that frequent itemset mining should scale on a single machine. No Spark cluster, no distributed coordination — just one process, tuned to be as efficient as possible.
The 1 Billion Row Challenge is how we hold ourselves accountable.
Where We Started: Sorted Chunks + K-Way Merge¶
The first streaming design took inspiration from external-sort databases:
- Each call to
add_chunk()receives(txn_id, item_id)arrays from Python. - Rust sorts each chunk in-place using
rayon::par_sort_unstable(). - Chunks exceeding the RAM budget are spilled to anonymous
tempfilefiles on disk. - When
.mine()is called, a k-way heap merge streams all sorted chunks in order, building the CSR matrix on the fly.
The first problem: disk exhaustion¶
Running the 500M → 1B targets, our machine hit 99% disk usage. The culprit: 500k transactions/chunk × 23 items/txn × 12 bytes/pair = ~138 MB per chunk. At 1B rows that's ~2,000 chunks = ~276 GB of tempfiles. Oops.
The second problem: mine_t grows super-linearly¶
| target_rows | add_t | mine_t | total |
|---|---|---|---|
| 300M | 36.1s | 516.8s | 552.9s |
| 500M | 65.2s | 1543.5s | 1608.7s |
The heap with 1,000+ cursors causes cache thrashing — a fundamental problem with the architecture.
The Insight: HashMap Aggregation¶
The key observation: the sorted-chunk approach stores every pair from every chunk. The real data that matters is unique_transactions × avg_items_per_transaction. For 1B rows with ~43M unique transactions × 23 items: that's only ~5GB vs the sorted approach's ~12GB.
We replaced the entire chunk + merge system with a single AHashMap<i64, Vec<i32>>:
pub fn add_chunk(&mut self, txn_ids: ..., item_ids: ...) {
for (&t, &i) in txns.iter().zip(items.iter()) {
self.txns.entry(t).or_default().push(i);
}
}
mine() now just:
- Collects
(txn_id, &items)from the HashMap par_sortby txn_id- Sort+dedup each transaction's item list
- Build CSR → feed to algorithm
No k-way merge. No disk spill. No tempfiles.
Initial numbers (100M & 200M, FP-Growth)¶
| target_rows | add_t | mine_t | total |
|---|---|---|---|
| 100M | 5.7s | 26.6s | 32.3s |
| 200M | 10.4s | 84.6s | 95.0s |
Compared to the old approach's 300M taking 299.5s — a 3× speedup just from the architecture change.
What We Tried: The Iteration Phase¶
Approach: SmallVec + u16 items (❌ Regression)¶
Using SmallVec<[u16; 32]> to store items inline on the stack caused a 2× slowdown. The 32-element inline size is too large for the CPU stack, causes cache pressure on every HashMap lookup.
Lesson: measure first, optimize second.
Knob: Chunk size (100k vs 500k vs 2M)¶
| chunk | add_t | mine_t | total |
|---|---|---|---|
| 100k | 10.0s | 3.8s | 13.8s |
| 500k | 12.1s | 5.3s | 17.4s |
| 2M | 12.5s | 5.3s | 17.8s |
Chunk size barely matters.
Knob: Algorithm (FP-Growth vs Eclat)¶
This was the biggest discovery. At 100M rows:
| method | add_t | mine_t | total | M rows/s |
|---|---|---|---|---|
| fpgrowth | 10.0s | ~55s | ~65s | 1.5 |
| eclat | 10.0s | 3.8s | 13.8s | 7.24 |
Eclat is ~14× faster at mining for dense retail data. The reason: Eclat works with vertical tidlists — for dense datasets with many frequent 2-itemsets, intersection operations are extremely cache-friendly.
Final Results: The Road to 1B¶
Dense retail data (avg 23 items/txn)¶
| target_rows | add_t | mine_t | total | M rows/s | itemsets |
|---|---|---|---|---|---|
| 100M | 12.5s | 6.2s | 18.7s | 5.35 | 15,218 |
| 200M | 22.7s | 13.4s | 36.1s | 5.53 | 15,226 |
| 500M | 61.1s | 38.7s | 99.8s | 5.01 | 15,234 |
| 1B | 173.5s | 208.5s | 382.1s | 2.62 | 15,233 |
✅ 1 Billion rows. 382 seconds. 15,233 frequent itemsets. No OOM. No disk spill.
Sparse catalogue data (avg 4.4 items/txn)¶
| target_rows | add_t | mine_t | total |
|---|---|---|---|
| 100M | 20.2s | 4.9s | 25.1s |
| 200M | 26.2s | 9.2s | 35.4s |
| 500M | ❌ | ❌ | OOM |
| 1B | ❌ | ❌ | OOM |
The 1B challenge for sparse datasets remains open.
What We Learned¶
Architecture beats micro-optimisation¶
The jump from k-way merge → HashMap changed 5-minute runs into 30-second runs.
Eclat vs FP-Growth depends on density¶
| data density | winner | why |
|---|---|---|
| Dense (avg >10 items/txn) | Eclat | Tidlist intersections are O(n), very cache-friendly |
| Sparse (avg <5 items/txn) | FP-Growth or similar | Fewer candidates, conditional bases are small |
Running It Yourself¶
from rusket import FPMiner
import numpy as np
miner = FPMiner(n_items=your_n_items)
for txn_ids_chunk, item_ids_chunk in your_data_stream():
miner.add_chunk(
txn_ids_chunk.astype(np.int64),
item_ids_chunk.astype(np.int32),
)
freq = miner.mine(min_support=0.02, max_len=3, method="eclat")
print(f"Found {len(freq):,} frequent itemsets")
The full benchmark script is in benchmarks/bench_fpminer_realistic.py.