Skip to content

Cookbook

A hands-on guide to every feature in rusket — from market basket analysis to billion-scale collaborative filtering.


Setup

pip install rusket
import numpy as np
import pandas as pd
import polars as pl
from rusket import FPGrowth, Eclat, FPGrowth, association_rules
from rusket import ALS, eALS, BPR, PrefixSpan, HUPM, Recommender

1. Market Basket Analysis — Grocery Retail

Business context

A supermarket chain wants to identify which product combinations appear most frequently in customer baskets. The output drives:

  • "Frequently Bought Together" widgets on the self-checkout screen
  • Shelf adjacency decisions (place high-lift pairs closer together)
  • Promotional bundles (discount pairs with high confidence but low current margin)

Prepare the basket data and find frequent combinations

import numpy as np
import pandas as pd
from rusket import FPGrowth

np.random.seed(42)

categories = {
    "Milk": 0.55, "Bread": 0.52, "Butter": 0.36, "Eggs": 0.41,
    "Cheese": 0.28, "Yogurt": 0.22, "Coffee": 0.31, "Tea": 0.18,
    "Sugar": 0.20, "Apples": 0.25, "Bananas": 0.30, "Oranges": 0.15,
    "Chicken": 0.35, "Pasta": 0.27, "Tomato Sauce": 0.26, "Onions": 0.40,
}

n_receipts = 10_000
df_long = pd.DataFrame(
    [(receipt, product)
     for receipt in range(n_receipts)
     for product, prob in categories.items()
     if np.random.rand() < prob],
    columns=["receipt_id", "product"],
)

miner = FPGrowth.from_transactions(
    df_long,
    transaction_col="receipt_id",
    item_col="product",
    min_support=0.05,
    use_colnames=True,
)
freq = miner.mine()
print(f"Found {len(freq):,} frequent product combinations")
top_combos = freq.sort_values("support", ascending=False)

Generate cross-sell rules

# Rules are now accessible directly from the miner instance
rules = miner.association_rules(min_threshold=0.3)
actionable = rules[(rules["confidence"] > 0.45) & (rules["lift"] > 1.2)]
print(actionable.sort_values("lift", ascending=False).head(10))

Limit itemset length for large catalogues

miner_pairs = FPGrowth.from_transactions(
    df_long,
    transaction_col="receipt_id",
    item_col="product",
    min_support=0.02,
    max_len=2,
    use_colnames=True,
)
freq_pairs = miner_pairs.mine()

2. ECLAT — When to Use vs FPGrowth

ECLAT uses a vertical bitset representation. It is faster than FPGrowth for sparse datasets.

from rusket import Eclat

freq_ec = Eclat.from_transactions(
    df_long,
    transaction_col="receipt_id",
    item_col="product",
    min_support=0.05,
    use_colnames=True,
).mine()
Condition Recommended class
Dense dataset, few items FPGrowth
Sparse dataset, many items, low support Eclat
Very large dataset (100M+ rows) FPMiner with streaming

3. Transaction Input Formats

From a Pandas DataFrame

import pandas as pd
from rusket import FPGrowth

orders = pd.DataFrame({
    "order_id": [1, 1, 1, 2, 2, 3],
    "item":     ["Milk", "Bread", "Eggs", "Milk", "Butter", "Eggs"],
})

freq = FPGrowth.from_transactions(
    orders,
    transaction_col="order_id",
    item_col="item",
    min_support=0.3,
    use_colnames=True,
).mine()

From a Polars DataFrame

import polars as pl
from rusket import FPGrowth

orders_pl = pl.DataFrame({
    "order_id": [1, 1, 1, 2, 2, 3],
    "item":     ["Milk", "Bread", "Eggs", "Milk", "Butter", "Eggs"],
})

freq = FPGrowth.from_transactions(
    orders_pl,
    transaction_col="order_id",
    item_col="item",
    min_support=0.3,
    use_colnames=True,
).mine()

From a list of lists

from rusket import FPGrowth

baskets = [["Milk", "Bread"], ["Milk", "Eggs", "Butter"], ["Bread", "Eggs"]]
freq = FPGrowth(baskets, min_support=0.5, use_colnames=True).mine()

4. Collaborative Filtering with ALS

Fit from purchase history

import pandas as pd
from rusket import ALS

purchases = pd.DataFrame({
    "customer_id": [1001, 1001, 1001, 1002, 1002, 1003, 1003, 1003],
    "sku":         ["A10", "B22", "C15",  "A10", "D33",  "B22", "C15", "E07"],
    "revenue":     [29.99, 49.00, 9.99,  29.99, 15.00, 49.00, 9.99, 22.00],
})

model = ALS.from_transactions(
    purchases,
    transaction_col="customer_id",
    item_col="sku",
    rating_col="revenue",
    factors=64,
    iterations=15,
    alpha=40.0,
    cg_iters=3,
).fit()

Get personalised recommendations

skus, scores = model.recommend_items(user_id=1002, n=5, exclude_seen=True)
top_customers, scores = model.recommend_users(item_id="B22", n=100)

Access latent factors (item embeddings) directly

# NumPy arrays (n_users x factors) and (n_items x factors)
print(model.user_factors.shape)  # (n_users, 64)
print(model.item_factors.shape)  # (n_items, 64)

# Semantic alias for LLM/GenAI workflows
embeddings = model.item_embeddings

4b. Element-wise ALS (eALS) — Faster Default

eALS is a convenience wrapper that sets use_eals=True by default. It updates factors element-by-element and is typically faster and more memory-efficient than the standard CG solver.

from rusket import eALS

model = eALS.from_transactions(
    purchases,
    transaction_col="customer_id",
    item_col="sku",
    rating_col="revenue",
    factors=64,
    iterations=15,
    alpha=40.0,
).fit()

# Exact same API as ALS — all methods work identically
skus, scores = model.recommend_items(user_id=1002, n=5)

Tip

eALS and ALS(use_eals=True) produce identical results. Use eALS for a cleaner API.

5. Out-of-Core ALS for 1B+ Ratings

Build the out-of-core CSR matrix

import numpy as np
from scipy import sparse
from pathlib import Path

data_dir = Path("data/ml-1b/ml-20mx16x32")
npz_files = sorted(data_dir.glob("trainx*.npz"))

# ... (out of core logic) ...

Fit ALS on the out-of-core matrix

from rusket import ALS

mat = sparse.csr_matrix((n_users, n_items))
mat.indptr  = indptr
mat.indices = mmap_indices
mat.data    = mmap_data

model = ALS(factors=64, iterations=5, alpha=40.0, verbose=True, cg_iters=3)
model.fit(mat)

Tip

On a machine with ≥ 32 GB RAM, each iteration completes in ~5 minutes. On 8 GB RAM, each iteration is disk-bound and takes hours.


6. Bayesian Personalized Ranking (BPR)

import numpy as np
import pandas as pd
from rusket import BPR

purchases = pd.DataFrame({
    "user_id": np.random.randint(0, 1000, size=5000),
    "item_id": np.random.randint(0, 500, size=5000),
})

model = BPR.from_transactions(
    purchases,
    transaction_col="user_id",
    item_col="item_id",
    factors=64,
    learning_rate=0.01,
    regularization=0.01,
    iterations=100,
    seed=42,
).fit()

items, scores = model.recommend_items(user_id=10, n=5)

7. Sequential Pattern Mining (PrefixSpan)

import pandas as pd
from rusket import PrefixSpan

clickstream = pd.DataFrame({
    "session_id": [1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4],
    "timestamp":  [10, 20, 30, 15, 25, 5, 15, 35, 10, 18, 40],
    "page": [
        "Home", "Pricing", "Checkout",
        "Home", "Pricing",
        "Features", "Pricing", "Checkout",
        "Home", "Features", "Checkout",
    ],
})

miner = PrefixSpan.from_transactions(
    clickstream,
    user_col="session_id",
    time_col="timestamp",
    item_col="page",
    min_support=2,
    max_len=4,
)
patterns_df = miner.mine()
print(patterns_df.head(10))

8. High-Utility Pattern Mining (HUPM)

import pandas as pd
from rusket import HUPM

receipts = pd.DataFrame({
    "receipt_id": [1, 1, 1, 2, 2, 3, 3, 4, 4, 4],
    "product":    ["champagne", "foie_gras", "truffle_oil",
                   "champagne", "truffle_oil",
                   "foie_gras", "truffle_oil",
                   "champagne", "foie_gras", "truffle_oil"],
    "margin":     [18.50, 14.00, 8.00, 18.50, 8.00, 14.00, 8.00, 18.50, 14.00, 8.00],
})

high_value = HUPM.from_transactions(
    receipts,
    transaction_col="receipt_id",
    item_col="product",
    utility_col="margin",
    min_utility=30.0,
).mine()

9. Native Polars Integration

All miners accept Polars DataFrames directly — no conversion needed:

import polars as pl
from rusket import FPGrowth

df_pl = pl.read_parquet("transactions.parquet")

freq = FPGrowth.from_transactions(
    df_pl,
    transaction_col="order_id",
    item_col="product_id",
    min_support=0.05,
    use_colnames=True,
).mine()

10. Spark / Databricks Integration

Streaming 1B+ Rows from Spark

from rusket import FPMiner

spark_df = spark.table("silver_transactions")
frequent_itemsets = FPMiner(
    spark_df,
    n_items=500_000,
    txn_col="transaction_id",
    item_col="product_id",
    min_support=0.001,
).mine()

Distributed Parallel Mining (Grouped)

from rusket.spark import mine_grouped

regional_rules_df = mine_grouped(spark_df, group_col="store_id", min_support=0.05)

Collaborative Filtering (ALS) from Spark

from rusket import ALS

model = ALS.from_transactions(
    spark.table("implicit_ratings"),
    transaction_col="user_id",
    item_col="item_id",
    rating_col="clicks",
    factors=64,
    iterations=10,
).fit()

11. Databricks: High-Speed Cross-Sell Generation

When working in Databricks with millions of users, Python for loops are a massive bottleneck. Use batch_recommend to leverage Rust's parallel iterators (Rayon) and return native Spark or Polars DataFrames instantly.

from rusket import ALS
import polars as pl
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
purchases = spark.table("bronze_layer.customer_transactions")

# 1. Train the model using the fast Polars bridge
als = ALS.from_transactions(
    purchases.toPandas(), # Or pass Polars directly if memory allows
    transaction_col="customer_id",
    item_col="product_id",
    rating_col="sales_amount",
    factors=128,
    iterations=15,
).fit()

# 2. Score ALL users simultaneously across all CPU cores (Rust Rayon)
#    Returns a fast Polars DataFrame: [user_id, item_id, score]
recommendations_pl = als.batch_recommend(n=10, format="polars")

# 3. Export L2-normalized item and user factors directly to Spark for Delta tables
user_factors_df = als.export_user_factors(normalize=True, format="spark")
item_factors_df = als.export_factors(normalize=True, format="spark")

# 4. Save to Delta
user_factors_df.write.format("delta").mode("overwrite").saveAsTable("silver_layer.user_embeddings")
item_factors_df.write.format("delta").mode("overwrite").saveAsTable("silver_layer.item_embeddings")

12. Tuning Guide

FPGrowth / Eclat / FPGrowth

Parameter Default Effect
min_support required Lower → more itemsets, slower
max_len None Cap itemset size — huge speedup on large catalogs
use_colnames False Return column names instead of indices

ALS

Parameter Default Notes
factors 64 Higher → better quality, more RAM, slower
iterations 15 5–15 is typical
alpha 40.0 Higher → stronger signal
cg_iters 3 CG solver steps
use_eals False Use eALS solver (faster, less memory)
eals_iters 1 Inner iterations for eALS
anderson_m 0 Anderson acceleration history (5 recommended)

13. Item Similarity and Cross-Selling Potential

# Now part of the Model class
item_ids, match_scores = model.similar_items(item_id=102, n=5)

14. Hybrid Recommender (ALS + Association Rules)

from rusket import Recommender

rec = Recommender(model=model, rules_df=rules)
item_ids, scores = rec.recommend_for_user(user_id=125, n=5)
suggested_additions = rec.recommend_for_cart([10, 15], n=3)

Advanced Pipeline with Business Rules

You can also use a Pipeline to forcefully inject curated item associations. Items from a RuleBasedRecommender completely bypass CF reranking and are artificially pushed to the top (e.g., +1,000,000 score).

import pandas as pd
from rusket import ALS, Pipeline, RuleBasedRecommender

als = ALS(factors=64).fit(interactions)

# "When buying headphones (102), always push the warranty (999)"
rules_df = pd.DataFrame({
    "antecedent": [102],
    "consequent": [999],
    "score": [2.0]
})
rules = RuleBasedRecommender.from_transactions(
    interactions, rules=rules_df, user_col="user_id", item_col="item_id"
).fit()

pipeline = Pipeline(
    retrieve=[als],
    rules=rules, 
)

# If the user previously interacted with item 102, 
# item 999 will rank #1 globally.
recs, scores = pipeline.recommend(user_id=42, n=5)

15. GenAI / LLM Stack Integration

import lancedb
# Direct export from model
df_vectors = model.export_factors()
db = lancedb.connect("./lancedb")
table = db.create_table("item_embeddings", data=df_vectors, mode="overwrite")

16. Visualizing Latent Spaces (PaCMAP)

# Built-in interactive 2D PaCMAP visualization via fluent API
fig = model.fit().pacmap2().plot(title="Latent Item Space")
fig.show()

17. Dealing with Cold Starts

The "cold start" problem is one of the most common challenges in building recommender systems. It occurs when a system cannot draw accurate inferences because it hasn't yet gathered enough data about a user or an item.

Here is how rusket addresses the three main types of cold starts:

1. Handling User Cold Starts (The "Folding In" Strategy)

If you have an existing ALS model and a new user signs up and clicks on a few items, you don't need to retrain the entire matrix. You can instantly "fold in" their early interactions (e.g. from an onboarding flow) to compute their latent factors on the fly:

import rusket

# Assume model is already fitted on millions of users
model = rusket.ALS(factors=64).fit(X)

# A new user views items [3, 105, 992]
new_user_items = [3, 105, 992]

# Instantly compute their 64-dimensional latent factor vector
user_factors = model.recalculate_user(new_user_items)

# You can now multiply this vector against model.item_factors to score all items
scores = model.item_factors.dot(user_factors)
top_items = scores.argsort()[::-1][:10]

2. Handling System Cold Starts (Knowledge & Context-Aware)

If you want to recommend items to a user based purely on their demographics or context before they even make a single click, you should use Factorization Machines (rusket.FM).

FM allows you to use a sparse feature matrix (one-hot encoded categories, ages, locations, time of day) instead of just user and item IDs. By learning the pairwise interactions between these features, FM can recommend items based entirely on metadata.

import rusket
from scipy.sparse import csr_matrix
import numpy as np

# Format: [User=Alice, Item=Laptop, Age=25-34, Category=Electronics]
# 1 represents the presence of that categorical feature
X = csr_matrix([
    [1, 1, 1, 1], # Alice bought Laptop
    [0, 1, 0, 1]  # Bob did not buy Laptop
], dtype=np.float32)

y = np.array([1.0, 0.0])

model = rusket.FM(factors=8, iterations=100)
model.fit(X, y)

# Predict CTR for a new user based on demographics
# X_new = [User=Charlie, Item=Laptop, Age=25-34, Category=Electronics]
X_new = csr_matrix([[0, 1, 1, 1]], dtype=np.float32)

ctr_prob = model.predict_proba(X_new)

Alternatively, rusket's Association Rule mining (FPGrowth, Eclat) can act as a knowledge-based fallback. Recommender.recommend_for_cart() uses explicit IF (A) THEN (B) rules to suggest items without relying on converged user factors.

3. Handling Item Cold Starts (Content-Based Hybrid)

When a new product is added to the catalog and no one has interacted with it yet, rusket.Recommender can fall back to semantic similarity.

By providing an item_embeddings matrix (e.g., dense vectors generated from OpenAI based on product descriptions), the Recommender intelligently blends behavioral CF scores with semantic similarity:

from rusket import Recommender

# Set alpha=0.0 for pure Content-Based semantic recommender, or alpha=0.5 for Hybrid
rec = Recommender(
    model=als_model, 
    item_embeddings=llm_text_embeddings
)

items, scores = rec.recommend_for_user(user_id=123, alpha=0.5)

18. Hyperparameter Tuning with Optuna + MLflow

rusket ships built-in Bayesian hyperparameter optimisation via Optuna. Use optuna_optimize to intelligently search the hyperparameter space — much more efficient than grid search.

Basic usage

import rusket

result = rusket.optuna_optimize(
    rusket.ALS,
    df,
    user_col="user_id",
    item_col="item_id",
    n_trials=50,
    metric="ndcg",
    k=10,
)

print(f"Best score: {result.best_score:.4f}")
print(f"Best params: {result.best_params}")

Custom search space

from rusket import OptunaSearchSpace, optuna_optimize, eALS

result = optuna_optimize(
    eALS,
    df,
    user_col="user_id",
    item_col="item_id",
    search_space=[
        OptunaSearchSpace.int("factors", 16, 256, log=True),
        OptunaSearchSpace.float("alpha", 1.0, 100.0, log=True),
        OptunaSearchSpace.float("regularization", 1e-4, 1.0, log=True),
        OptunaSearchSpace.int("iterations", 5, 30),
    ],
    n_trials=100,
    n_folds=3,
    metric="precision",
    refit_best=True,
)

# result.best_model is already fitted on the full dataset
items, scores = result.best_model.recommend_items(user_id=42, n=10)

MLflow experiment tracking

Log every trial's parameters and metrics to MLflow automatically:

pip install mlflow optuna-integration
import mlflow
import rusket

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("als-tuning")

result = rusket.optuna_optimize(
    rusket.ALS,
    df,
    user_col="user_id",
    item_col="item_id",
    n_trials=50,
    metric="ndcg",
    mlflow_tracking=True,  # ← logs every trial to MLflow
)

Custom callbacks

Pass any Optuna-compatible callback to study.optimize():

result = rusket.optuna_optimize(
    rusket.ALS,
    df,
    user_col="user_id",
    item_col="item_id",
    n_trials=50,
    callbacks=[my_custom_callback],
)