API Reference¶
This file is auto-generated by
scripts/gen_api_reference.py. Do not edit by hand — update the Python docstrings instead.
Functional API¶
Convenience module-level functions. For most use-cases these are the only entry points you need.
mine¶
Mine frequent itemsets using the specified algorithm.
This module-level function relies on the Object-Oriented APIs.
from rusket.miners.mine import mine
mine(df: 'pd.DataFrame | pl.DataFrame | np.ndarray | Any', min_support: 'float' = 0.5, null_values: 'bool' = False, use_colnames: 'bool' = True, max_len: 'int | None' = None, method: 'str' = 'fpgrowth', verbose: 'int' = 0, column_names: 'list[str] | None' = None) -> 'pd.DataFrame'
fpgrowth¶
Find frequent itemsets using the optimal algorithm (Eclat or FP-growth).
This module-level function relies on the Object-Oriented APIs.
from rusket.miners.fpgrowth import fpgrowth
fpgrowth(df: 'pd.DataFrame | pl.DataFrame | np.ndarray | Any', min_support: 'float' = 0.5, null_values: 'bool' = False, use_colnames: 'bool' = True, max_len: 'int | None' = None, method: 'str' = 'fpgrowth', verbose: 'int' = 0, column_names: 'list[str] | None' = None) -> 'pd.DataFrame'
eclat¶
Find frequent itemsets using the Eclat algorithm.
This module-level function relies on the Object-Oriented APIs.
from rusket.miners.eclat import eclat
eclat(df: 'pd.DataFrame | pl.DataFrame | np.ndarray | Any', min_support: 'float' = 0.5, null_values: 'bool' = False, use_colnames: 'bool' = True, max_len: 'int | None' = None, verbose: 'int' = 0, column_names: 'list[str] | None' = None) -> 'pd.DataFrame'
association_rules¶
from rusket.miners.association_rules import association_rules
association_rules(df: 'pd.DataFrame | Any', num_itemsets: 'int | None' = None, df_orig: 'pd.DataFrame | None' = None, null_values: 'bool' = False, metric: 'str' = 'confidence', min_threshold: 'float' = 0.8, support_only: 'bool' = False, return_metrics: 'list[str]' = ['antecedent support', 'consequent support', 'support', 'confidence', 'lift', 'representativity', 'leverage', 'conviction', 'zhangs_metric', 'jaccard', 'certainty', 'kulczynski']) -> 'pd.DataFrame'
prefixspan¶
Mine sequential patterns using the PrefixSpan algorithm.
This function discovers frequent sequences of items across multiple users/sessions. Currently, this assumes sequences where each event consists of a single item (e.g., a sequence of page views or a sequence of individual products bought over time).
from rusket.miners.prefixspan import prefixspan
prefixspan(sequences: 'list[list[int]]', min_support: 'int | float', max_len: 'int | None' = None) -> 'pd.DataFrame'
Parameters
| Parameter | Type | Description |
|---|---|---|
| sequences | list of list of int | A list of sequences, where each sequence is a list of integers representing items. Example: [[1, 2, 3], [1, 3], [2, 3]]. |
| min_support | int | float | The minimum absolute support (number of sequences a pattern must appear in), or float percent. |
| max_len | int, optional | The maximum length of the sequential patterns to mine. |
Returns
| Name | Type | Description |
|---|---|---|
| pd.DataFrame | A DataFrame containing 'support' and 'sequence' columns. |
hupm¶
Mine high-utility itemsets.
This function discovers combinations of items that generate a high total utility (e.g., profit) across all transactions, even if they aren't the most frequent.
from rusket.miners.hupm import hupm
hupm(transactions: 'list[list[int]]', utilities: 'list[list[float]]', min_utility: 'float', max_len: 'int | None' = None) -> 'pd.DataFrame'
Parameters
| Parameter | Type | Description |
|---|---|---|
| transactions | list of list of int | A list of transactions, where each transaction is a list of item IDs. |
| utilities | list of list of float | A list of identical structure to transactions, but containing the numeric utility (e.g., profit) of that item in that specific transaction. |
| min_utility | float | The minimum total utility required to consider a pattern "high-utility". |
| max_len | int, optional | The maximum length of the itemsets to mine. |
Returns
| Name | Type | Description |
|---|---|---|
| pd.DataFrame | A DataFrame containing 'utility' and 'itemset' columns. |
sequences_from_event_log¶
Convert an event log DataFrame into the sequence format required by PrefixSpan.
Accepts Pandas, Polars, or PySpark DataFrames. Data is grouped by user_col,
ordered by time_col, and item_col values are collected into sequences.
from rusket.miners.prefixspan import sequences_from_event_log
sequences_from_event_log(df: 'Any', user_col: 'str', time_col: 'str', item_col: 'str') -> 'tuple[list[list[int]], dict[int, Any]]'
Parameters
| Parameter | Type | Description |
|---|---|---|
| df | pd.DataFrame | pl.DataFrame | pyspark.sql.DataFrame | Event log containing users, timestamps, and items. |
| user_col | str | Column name identifying the sequence (e.g., user_id or session_id). |
| time_col | str | Column name for ordering events. |
| item_col | str | Column name for the items. |
Returns
| Name | Type | Description |
|---|---|---|
| tuple of (indptr, indices, item_mapping) | - indptr: CSR-style index pointer list. - indices: Flattened item index list. - item_mapping: A dictionary mapping the integer IDs back to the original item labels. |
mine_hupm¶
Mine high-utility itemsets from a long-format DataFrame.
Converts a Pandas or Polars DataFrame into the required list-of-lists format and runs the High-Utility Pattern Mining (HUPM) algorithm.
from rusket.miners.hupm import mine_hupm
mine_hupm(data: 'Any', transaction_col: 'str', item_col: 'str', utility_col: 'str', min_utility: 'float', max_len: 'int | None' = None) -> 'pd.DataFrame'
Parameters
| Parameter | Type | Description |
|---|---|---|
| data | pd.DataFrame or pl.DataFrame | A long-format DataFrame where each row represents an item in a transaction. |
| transaction_col | str | Column name identifying the transaction ID. |
| item_col | str | Column name identifying the item ID (must be numeric integers). |
| utility_col | str | Column name identifying the numeric utility (e.g. price, profit) of the item. |
| min_utility | float | The minimum total utility required to consider a pattern "high-utility". |
| max_len | int, optional | Maximum length of the itemsets to mine. |
Returns
| Name | Type | Description |
|---|---|---|
| pd.DataFrame | A DataFrame containing 'utility' and 'itemset' columns. |
mine_duckdb¶
Stream directly from a DuckDB query via Arrow RecordBatches.
This is extremely memory efficient, bypassing Pandas entirely.
from rusket.miners.streaming import mine_duckdb
mine_duckdb(con: 'Any', query: 'str', n_items: 'int', txn_col: 'str', item_col: 'str', min_support: 'float' = 0.5, max_len: 'int | None' = None, chunk_size: 'int' = 1000000) -> 'pd.DataFrame'
mine_spark¶
Stream natively from a PySpark DataFrame on Databricks via Arrow.
Uses toLocalIterator() to fetch Arrow chunks incrementally directly
to the driver node, avoiding massive memory spikes.
from rusket.miners.streaming import mine_spark
mine_spark(spark_df: 'Any', n_items: 'int', txn_col: 'str', item_col: 'str', min_support: 'float' = 0.5, max_len: 'int | None' = None) -> 'pd.DataFrame'
from_transactions¶
Convert long-format transactional data to a one-hot boolean matrix.
The return type mirrors the input type:
- Polars
DataFrame→ PolarsDataFrame - Pandas
DataFrame→ PandasDataFrame - Spark
DataFrame→ SparkDataFrame list[list[...]]→ PandasDataFrame
from rusket.miners.transactions import from_transactions
from_transactions(data: 'DataFrame | Sequence[Sequence[str | int]] | Any', transaction_col: 'str | None' = None, item_col: 'str | None' = None, min_item_count: 'int' = 1, verbose: 'int' = 0) -> 'Any'
Parameters
| Parameter | Type | Description |
|---|---|---|
| data | One of: - Pandas / Polars / Spark DataFrame with (at least) two columns: one for the transaction identifier and one for the item. - List of lists where each inner list contains the items of a single transaction, e.g. [["bread", "milk"], ["bread", "eggs"]]. |
|
| transaction_col | Name of the column that identifies transactions. If None the first column is used. Ignored for list-of-lists input. |
|
| item_col | Name of the column that contains item values. If None the second column is used. Ignored for list-of-lists input. |
|
| min_item_count | Minimum number of times an item must appear to be included in the resulting one-hot-encoded matrix. Default is 1. |
Returns
| Name | Type | Description |
|---|---|---|
| DataFrame | A boolean DataFrame (same type as input) ready for :func:rusket.fpgrowth or :func:rusket.eclat. Column names correspond to the unique items. |
Examples
>>> import rusket
>>> import pandas as pd
>>> df = pd.DataFrame({
... "order_id": [1, 1, 1, 2, 2, 3],
... "item": [3, 4, 5, 3, 5, 8],
... })
>>> ohe = rusket.from_transactions(df)
>>> freq = rusket.fpgrowth(ohe, min_support=0.5, use_colnames=True)
from_transactions_csr¶
Convert long-format transactional data to a CSR matrix + column names.
Unlike :func:from_transactions, this returns a raw
scipy.sparse.csr_matrix that can be passed directly to
:func:rusket.fpgrowth or :func:rusket.eclat — no pandas overhead.
For billion-row datasets, this processes data in chunks of chunk_size
rows, keeping peak memory to one chunk + the running CSR.
from rusket.miners.transactions import from_transactions_csr
from_transactions_csr(data: 'DataFrame | str | Any', transaction_col: 'str | None' = None, item_col: 'str | None' = None, chunk_size: 'int' = 10000000) -> 'tuple[Any, list[str]]'
Parameters
| Parameter | Type | Description |
|---|---|---|
| data | One of: - Pandas DataFrame with (at least) two columns. - Polars DataFrame or Spark DataFrame (converted internally). - File path (str / Path) to a Parquet file — read in chunks. | |
| transaction_col | Name of the transaction-id column. Defaults to the first column. | |
| item_col | Name of the item column. Defaults to the second column. | |
| chunk_size | Number of rows per chunk. Lower values use less memory. Default: 10 million rows. |
Returns
| Name | Type | Description |
|---|---|---|
| tuple[scipy.sparse.csr_matrix, list[str]] | A CSR matrix and the list of column (item) names. Pass directly:: csr, names = from_transactions_csr(df) freq = fpgrowth(csr, min_support=0.001, use_colnames=True, column_names=names) |
Examples
>>> import rusket
>>> csr, names = rusket.from_transactions_csr("orders.parquet")
>>> freq = rusket.fpgrowth(csr, min_support=0.001,
... use_colnames=True, column_names=names)
from_pandas¶
Shorthand for from_transactions(df, transaction_col, item_col).
from rusket.miners.transactions import from_pandas
from_pandas(df: 'pd.DataFrame', transaction_col: 'str | None' = None, item_col: 'str | None' = None, min_item_count: 'int' = 1, verbose: 'int' = 0) -> 'pd.DataFrame'
from_polars¶
Shorthand for from_transactions(df, transaction_col, item_col).
from rusket.miners.transactions import from_polars
from_polars(df: 'pl.DataFrame', transaction_col: 'str | None' = None, item_col: 'str | None' = None, min_item_count: 'int' = 1, verbose: 'int' = 0) -> 'pl.DataFrame'
from_spark¶
Shorthand for from_transactions(df, transaction_col, item_col).
from rusket.miners.transactions import from_spark
from_spark(df: 'SparkDataFrame', transaction_col: 'str | None' = None, item_col: 'str | None' = None, min_item_count: 'int' = 1, verbose: 'int' = 0) -> 'SparkDataFrame'
from_arrow¶
Convert a PyArrow Table in long format to a one-hot boolean PyArrow Table.
This is a zero-copy-friendly shorthand for from_transactions(table, ...). The
input table must have at least two columns: one for the transaction identifier and
one for the item. The returned table has boolean columns (one per unique item).
from rusket.miners.transactions import from_arrow
from_arrow(table: 'pa.Table', transaction_col: 'str | None' = None, item_col: 'str | None' = None, min_item_count: 'int' = 1, verbose: 'int' = 0) -> 'pa.Table'
Parameters
| Parameter | Type | Description |
|---|---|---|
| table | A pyarrow.Table with at least two columns (transaction id + item). |
|
| transaction_col | Name of the transaction-id column. Defaults to the first column. | |
| item_col | Name of the item column. Defaults to the second column. | |
| min_item_count | Minimum occurrences for an item to be included. Default is 1. | |
| verbose | Verbosity level. |
Returns
| Name | Type | Description |
|---|---|---|
| pyarrow.Table | A boolean Table ready for :func:rusket.fpgrowth / :func:rusket.eclat. |
evaluate¶
Evaluate a trained recommendation model on a test set.
Compute metrics like NDCG@k, Hit Rate@k, Precision@k, and Recall@k using fast natively-backed Rust evaluation loops.
When a model has _user_labels / _item_labels (set by
from_transactions()), the test IDs are automatically mapped to
internal 0-based indices so that recommend_items() receives valid
indices and the recommended item indices can be compared with the
ground truth.
from rusket.evaluation.metrics import evaluate
evaluate(model: 'Any', test_interactions: 'Any', k: 'int' = 10, metrics: 'list[MetricName] | None' = None) -> 'dict[str, float]'
Parameters
| Parameter | Type | Description |
|---|---|---|
| model | Any | A trained recommendation model supporting recommend_items(user_id, k, exclude_seen). |
| test_interactions | np.ndarray or pd.DataFrame | Ground truth test interactions. Must either have columns "user" and "item", or be a 2D array format. |
| k | int, default=10 | The cutoff rank for evaluation. |
| metrics | list of str, optional | Metrics to compute. Default: ["ndcg", "hr", "precision", "recall"]. |
Returns
| Name | Type | Description |
|---|---|---|
| dict[str, float] | Dictionary of averaged metric values. |
train_test_split¶
Split interactions into random train and test sets.
from rusket.evaluation.splitting import train_test_split
train_test_split(df: 'pd.DataFrame', user_col: 'str', item_col: 'str', test_size: 'float' = 0.2, random_state: 'int | None' = None) -> 'tuple[pd.DataFrame, pd.DataFrame]'
Parameters
| Parameter | Type | Description |
|---|---|---|
| df | pd.DataFrame | The interaction dataframe. |
| user_col | str | Name of the user column. |
| item_col | str | Name of the item column. |
| test_size | float, default=0.2 | Percentage of data to put in the test set. |
| random_state | int, optional | Set random seed (currently not used by Rust backend, but reserved for future). |
Returns
| Name | Type | Description |
|---|---|---|
| tuple[pd.DataFrame, pd.DataFrame] | train_df, test_df |
leave_one_out_split¶
Leave exactly one interaction per user for the test set.
If a timestamp column is provided, the latest interaction is left out. If no timestamp is provided, a random interaction is chosen.
from rusket.evaluation.splitting import leave_one_out_split
leave_one_out_split(df: 'pd.DataFrame', user_col: 'str', item_col: 'str', timestamp_col: 'str | None' = None) -> 'tuple[pd.DataFrame, pd.DataFrame]'
Parameters
| Parameter | Type | Description |
|---|---|---|
| df | pd.DataFrame | The interaction dataframe. |
| user_col | str | Name of the user column (must be numeric encoded to i32 ideally, or pandas int). |
| item_col | str | Name of the item column. |
| timestamp_col | str, optional | Name of the timestamp or ordering column. |
Returns
| Name | Type | Description |
|---|---|---|
| tuple[pd.DataFrame, pd.DataFrame] | train_df, test_df |
pca¶
Project data into n_components dimensions using PCA.
from rusket.viz.pca import pca
pca(x: 'npt.NDArray[Any]', n_components: 'int' = 2, svd_solver: 'str' = 'auto') -> 'ProjectedSpace'
Parameters
| Parameter | Type | Description |
|---|---|---|
| x | array-like of shape (n_samples, n_features) | |
| n_components | int, default=2 | |
| svd_solver | {"auto", "exact", "randomized"}, default="auto" |
pca2¶
Project data into exactly 2 dimensions using PCA.
from rusket.viz.pca import pca2
pca2(x: 'npt.NDArray[Any]', svd_solver: 'str' = 'auto') -> 'ProjectedSpace'
pca3¶
Project data into exactly 3 dimensions using PCA.
from rusket.viz.pca import pca3
pca3(x: 'npt.NDArray[Any]', svd_solver: 'str' = 'auto') -> 'ProjectedSpace'
OOP Mining API¶
All mining classes share a common Miner.from_transactions() / .mine() interface. FPGrowth, Eclat, FIN, LCM, and HUPM also inherit RuleMinerMixin which adds .association_rules() and .recommend_items() helpers.
FPGrowth¶
FP-Growth frequent itemset miner.
This class wraps the fast, core Rust FP-Growth implementation.
from rusket.miners.fpgrowth import FPGrowth
FPGrowth(data: 'pd.DataFrame | pl.DataFrame | np.ndarray | Any', item_names: 'list[str] | None' = None, min_support: 'float' = 0.5, null_values: 'bool' = False, use_colnames: 'bool' = True, max_len: 'int | None' = None, verbose: 'int' = 0, **kwargs: 'Any')
FPGrowth.mine¶
Execute the FP-growth algorithm on the stored data.
Returns
| Name | Type | Description |
|---|---|---|
| pandas.DataFrame | DataFrame with two columns: - support: the support score. - itemsets: list of items (indices or column names). |
Eclat¶
Eclat frequent itemset miner.
Eclat is typically faster than FP-growth on dense datasets due to efficient vertical bitset intersection logic.
from rusket.miners.eclat import Eclat
Eclat(data: 'pd.DataFrame | pl.DataFrame | np.ndarray | Any', item_names: 'list[str] | None' = None, min_support: 'float' = 0.5, null_values: 'bool' = False, use_colnames: 'bool' = True, max_len: 'int | None' = None, verbose: 'int' = 0, **kwargs: 'Any')
Eclat.mine¶
Execute the Eclat algorithm on the stored data.
Returns
| Name | Type | Description |
|---|---|---|
| pandas.DataFrame | DataFrame with two columns: - support: the support score. - itemsets: list of items (indices or column names). |
PrefixSpan¶
Sequential Pattern Mining (PrefixSpan) model.
This class discovers frequent sequences of items across multiple users/sessions.
from rusket.miners.prefixspan import PrefixSpan
PrefixSpan(data: 'list[list[int]]', min_support: 'int | float', max_len: 'int | None' = None, item_mapping: 'dict[int, Any] | None' = None)
PrefixSpan.mine¶
Mine sequential patterns using PrefixSpan.
from rusket.miners.prefixspan import PrefixSpan.mine
PrefixSpan.mine(**kwargs: 'Any') -> 'pd.DataFrame'
Returns
| Name | Type | Description |
|---|---|---|
| pd.DataFrame | A DataFrame containing 'support' and 'sequence' columns. Sequences are mapped back to original item names if from_transactions was used. |
HUPM¶
High-Utility Pattern Mining (HUPM) model.
This class discovers combinations of items that generate a high total utility (e.g., profit) across all transactions, even if they aren't the most frequent.
from rusket.miners.hupm import HUPM
HUPM(transactions: 'list[list[int]]', utilities: 'list[list[float]]', min_utility: 'float', max_len: 'int | None' = None)
HUPM.mine¶
Mine high-utility itemsets.
Returns
| Name | Type | Description |
|---|---|---|
| pd.DataFrame | A DataFrame containing 'utility' and 'itemset' columns. |
FPMiner¶
Streaming FP-Growth / Eclat accumulator for billion-row datasets.
Feeds (transaction_id, item_id) integer arrays to Rust one chunk at a
time. Rust accumulates per-transaction item lists in a
HashMap<i64, Vec<i32>>. Peak Python memory = one chunk.
from rusket.miners.streaming import FPMiner
FPMiner(n_items: 'int', max_ram_mb: 'int | None' = -1, hint_n_transactions: 'int | None' = None) -> 'None'
Parameters
| Parameter | Type | Description |
|---|---|---|
| n_items | int | Number of distinct items (column count). All item IDs fed via :meth:add_chunk must be in [0, n_items). |
Examples
Process a Parquet file 10 M rows at a time:
>>> import pandas as pd
>>> import numpy as np
>>> from rusket import FPMiner
>>> miner = FPMiner(n_items=500_000)
>>> for chunk in pd.read_parquet("orders.parquet", chunksize=10_000_000):
... txn = chunk["txn_id"].to_numpy(dtype="int64")
... item = chunk["item_idx"].to_numpy(dtype="int32")
... miner.add_chunk(txn, item)
>>> freq = miner.mine(min_support=0.001, max_len=3, use_colnames=True)
FPMiner.add_arrow_batch¶
Feed a PyArrow RecordBatch directly into the miner. Zero-copy extraction is used if types match (Int64/Int32).
from rusket.miners.streaming import FPMiner.add_arrow_batch
FPMiner.add_arrow_batch(batch: 'Any', txn_col: 'str', item_col: 'str') -> 'FPMiner'
FPMiner.add_chunk¶
Feed a chunk of (transaction_id, item_id) pairs.
from rusket.miners.streaming import FPMiner.add_chunk
FPMiner.add_chunk(txn_ids: 'np.ndarray', item_ids: 'np.ndarray') -> 'FPMiner'
Parameters
| Parameter | Type | Description |
|---|---|---|
| txn_ids | np.ndarray[int64] | 1-D array of transaction identifiers (arbitrary 64-bit integers). |
| item_ids | np.ndarray[int32] | 1-D array of item column indices (0-based). |
Returns
| Name | Type | Description |
|---|---|---|
| self (for chaining) |
FPMiner.fit¶
Sklearn-compatible alias for mine(). Runs the mining algorithm.
Returns
| Name | Type | Description |
|---|---|---|
| self |
FPMiner.mine¶
Mine frequent itemsets from all accumulated transactions.
from rusket.miners.streaming import FPMiner.mine
FPMiner.mine(min_support: 'float' = 0.5, max_len: 'int | None' = None, use_colnames: 'bool' = True, column_names: 'list[str] | None' = None, method: "typing.Literal['fpgrowth', 'eclat']" = 'fpgrowth', verbose: 'int' = 0) -> 'pd.DataFrame'
Parameters
| Parameter | Type | Description |
|---|---|---|
| min_support | float | Minimum support threshold in (0, 1]. |
| max_len | int | None | Maximum itemset length. |
| use_colnames | bool | If True, itemsets contain column names instead of indices. |
| column_names | list[str] | None | Column names to use when use_colnames=True. |
| method | "fpgrowth" | "eclat" | Mining algorithm to use. |
| verbose | int | Level of verbosity: >0 prints progress logs and times. |
Returns
| Name | Type | Description |
|---|---|---|
| pd.DataFrame | Columns support and itemsets. |
FPMiner.predict¶
Return the last mined result, or run fit() first.
from rusket.miners.streaming import FPMiner.predict
FPMiner.predict(**kwargs: 'Any') -> 'pd.DataFrame'
Returns
| Name | Type | Description |
|---|---|---|
| pd.DataFrame | The frequent itemsets. |
FPMiner.reset¶
Free all accumulated data.
FIN¶
FIN (Fast Itemset per Nodeset) frequent itemset miner.
This class wraps the fast core Rust FIN implementation.
from rusket.miners.fin import FIN
FIN(data: 'pd.DataFrame | pl.DataFrame | np.ndarray | Any', item_names: 'list[str] | None' = None, min_support: 'float' = 0.5, null_values: 'bool' = False, use_colnames: 'bool' = True, max_len: 'int | None' = None, verbose: 'int' = 0, **kwargs: 'Any')
FIN.mine¶
Execute the FIN algorithm on the stored data.
Returns
| Name | Type | Description |
|---|---|---|
| pandas.DataFrame | DataFrame with two columns: - support: the support score. - itemsets: list of items (indices or column names). |
LCM¶
LCM (Linear Closed Itemset Miner) frequent itemset miner.
This class wraps the fast core Rust LCM implementation using Prefix-Preserving Closure Extension. It produces only closed frequent itemsets, offering massive memory savings and faster execution out-of-the-box compared to classic algorithms on dense datasets.
from rusket.miners.lcm import LCM
LCM(data: 'pd.DataFrame | pl.DataFrame | np.ndarray | Any', item_names: 'list[str] | None' = None, min_support: 'float' = 0.5, null_values: 'bool' = False, use_colnames: 'bool' = True, max_len: 'int | None' = None, verbose: 'int' = 0, **kwargs: 'Any')
LCM.mine¶
Execute the LCM algorithm on the stored data to find closed itemsets.
Returns
| Name | Type | Description |
|---|---|---|
| pandas.DataFrame | DataFrame with two columns: - support: the support score. - itemsets: list of items (indices or column names). |
RuleMinerMixin — Shared Miner Interface¶
FPGrowth, Eclat, FIN, LCM, and HUPM all inherit these methods from RuleMinerMixin. You do not construct RuleMinerMixin directly.
RuleMinerMixin.association_rules¶
Generate association rules from the mined frequent itemsets.
from rusket.model._mixins import RuleMinerMixin.association_rules
RuleMinerMixin.association_rules(metric: 'str' = 'confidence', min_threshold: 'float' = 0.8, return_metrics: 'list[str] | None' = None) -> 'pd.DataFrame'
Parameters
| Parameter | Type | Description |
|---|---|---|
| metric | str, default='confidence' | The metric to evaluate if a rule is of interest. |
| min_threshold | float, default=0.8 | The minimum threshold for the evaluation metric. |
| return_metrics | list[str] | None, default=None | List of metrics to include in the resulting DataFrame. Defaults to all available metrics. |
Returns
| Name | Type | Description |
|---|---|---|
| pd.DataFrame | DataFrame of strong association rules. |
RuleMinerMixin.recommend_items¶
Deprecated: use :meth:recommend_for_cart instead.
from rusket.model._mixins import RuleMinerMixin.recommend_items
RuleMinerMixin.recommend_items(items: 'list[Any]', n: 'int' = 5) -> 'list[Any]'
RuleMinerMixin._invalidate_rules_cache¶
Clear the cached association rules (call after re-mining).
from rusket.model._mixins import RuleMinerMixin._invalidate_rules_cache
RuleMinerMixin._invalidate_rules_cache() -> 'None'
Recommenders¶
ALS¶
Implicit ALS collaborative filtering model.
from rusket.recommenders.als import ALS
ALS(factors: 'int' = 64, regularization: 'float' = 0.01, alpha: 'float' = 40.0, iterations: 'int' = 15, seed: 'int' = 42, verbose: 'int' = 0, cg_iters: 'int' = 10, use_cholesky: 'bool' = False, use_eals: 'bool' = False, eals_iters: 'int' = 1, anderson_m: 'int' = 0, popularity_weighting: 'str' = 'none', use_biases: 'bool' = False, alpha_view: 'float' = 10.0, view_target: 'float' = 0.5, use_cuda: 'bool | None' = None, **kwargs: 'Any') -> 'None'
Parameters
| Parameter | Type | Description |
|---|---|---|
| factors | int | Number of latent factors. |
| regularization | float | L2 regularisation weight. |
| alpha | float | Confidence scaling: confidence = 1 + alpha * r. |
| iterations | int | Number of ALS outer iterations. |
| seed | int | Random seed. |
| cg_iters | int | Conjugate Gradient iterations per user/item solve (ignored when use_cholesky=True). Reduce to 3 for very large datasets. |
| use_cholesky | bool | Use a direct Cholesky solve instead of iterative CG. Exact solution; faster when users have many interactions relative to factors. |
| use_eals | bool | Use element-wise ALS (eALS). Usually faster than Cholesky/CG and less memory intensive. |
| eals_iters | int | Number of inner iterations for eALS (default 1). |
| anderson_m | int | History window for Anderson Acceleration of the outer ALS loop (default 0 = disabled). Recommended value: 5. ALS is a fixed-point iteration (U,V) → F(U,V). Anderson mixing extrapolates over the last m residuals to reach the fixed point faster, typically reducing the number of outer iterations by 30–50 % at identical recommendation quality:: # Baseline: 15 iterations model = ALS(iterations=15, cg_iters=3) # Anderson-accelerated: 10 iterations, ~2.5× faster, same quality model = ALS(iterations=10, cg_iters=3, anderson_m=5) Memory overhead: m copies of the full (U ∥ V) matrix (~57 MB per copy at 25M ratings, k=64). |
| popularity_weighting | str | Weighting scheme for missing data in eALS. Items that are frequently interacted-with provide stronger negative signals when not chosen. Options: "none" (uniform, default), "sqrt", "log", "linear". Only used when use_eals=True. |
| use_biases | bool | If True, learn global bias (μ), user biases (b_u), and item biases (b_i) so that prediction becomes μ + b_u + b_i + w_u · h_i. |
| alpha_view | float | Confidence scaling for view interactions in VALS mode. Pass view_matrix to fit() to enable. Default 10.0. |
| view_target | float | Target value for view interactions (between 0.0 and 1.0). Purchases always target 1.0. Default 0.5. |
| use_cuda | bool | If True, use CUDA acceleration (CuPy or PyTorch) for batch recommendation. Falls back to CPU if no CUDA backend found. Default False. |
Examples
Fold in a new user without retraining the entire model matrix:
>>> import rusket
>>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> # Fit model on some data
>>> model = rusket.ALS(factors=8).fit(csr_matrix(np.random.randint(0, 2, size=(10, 20))))
>>> # New user interacts with items 3, 5, and 12
>>> latent_factors = model.recalculate_user([3, 5, 12])
>>> # `latent_factors` is a 1D array of length `factors=8`
ALS.batch_recommend¶
Top-N items for all users efficiently computed in parallel.
from rusket.recommenders.als import ALS.batch_recommend
ALS.batch_recommend(n: 'int' = 10, exclude_seen: 'bool' = True, format: "Literal['pandas', 'polars', 'spark']" = 'polars') -> 'Any'
Parameters
| Parameter | Type | Description |
|---|---|---|
| n | int, default=10 | The number of items to recommend per user. |
| exclude_seen | bool, default=True | Whether to exclude items the user has already interacted with. |
| format | str, default="polars" | The DataFrame format to return. One of "pandas", "polars", or "spark". |
Returns
| Name | Type | Description |
|---|---|---|
| DataFrame | A DataFrame with columns user_id, item_id, and score. |
ALS.build_ann_index¶
Build an Approximate Nearest Neighbor index from item factors.
from rusket.recommenders.als import ALS.build_ann_index
ALS.build_ann_index(backend: 'str' = 'native', index_type: 'str' = 'hnsw', **kwargs: 'Any') -> 'Any'
Parameters
| Parameter | Type | Description |
|---|---|---|
| backend | str | "native" uses the built-in Rust random-projection forest (:class:~rusket.ApproximateNearestNeighbors). "faiss" uses FAISS (requires pip install faiss-cpu). |
| index_type | str | For "faiss" backend: "flat", "hnsw", "ivfflat", "ivfpq". Ignored for "native" backend. |
| **kwargs | Additional arguments passed to the index builder. |
Returns
| Name | Type | Description |
|---|---|---|
| index | A fitted ANN index with a query() / kneighbors() method. |
ALS.fit¶
Fit the model to the user-item interaction matrix.
from rusket.recommenders.als import ALS.fit
ALS.fit(interactions: 'Any' = None, *, view_matrix: 'Any' = None) -> 'ALS'
Parameters
| Parameter | Type | Description |
|---|---|---|
| interactions | sparse matrix or numpy array, optional | If None, uses the matrix prepared by from_transactions(). |
| view_matrix | sparse matrix or numpy array, optional | Optional view/browse interaction matrix (same shape as interactions). When provided, enables VALS mode: views are treated as weaker positive signals with confidence alpha_view targeting view_target. |
Raises
| Exception | Condition |
|---|---|
| RuntimeError | |
| TypeError |
ALS.recalculate_user¶
Calculate the latent factors for a new or existing user given their interacted items.
from rusket.recommenders.als import ALS.recalculate_user
ALS.recalculate_user(user_items: 'Any') -> 'np.ndarray'
Parameters
| Parameter | Type | Description |
|---|---|---|
| user_items | list of int or 1D array-like | The item indices the user has interacted with. If the model was fitted using a DataFrame with item names, these should be the mapped item indices from 0 to n_items - 1. Note: Confidence values for interactions are currently treated as 1. |
Returns
| Name | Type | Description |
|---|---|---|
| ndarray | A 1D numpy array of shape (factors,) containing the user's latent factors. |
Raises
| Exception | Condition |
|---|---|
| RuntimeError | |
| ValueError |
ALS.recommend_items¶
Top-N items for a user. Set exclude_seen=False to include already-seen items.
from rusket.recommenders.als import ALS.recommend_items
ALS.recommend_items(user_id: 'int', n: 'int' = 10, exclude_seen: 'bool' = True) -> 'tuple[Any, Any]'
ALS.recommend_users¶
Top-N users for an item.
from rusket.recommenders.als import ALS.recommend_users
ALS.recommend_users(item_id: 'int', n: 'int' = 10) -> 'tuple[Any, Any]'
eALS¶
Element-wise ALS (eALS) collaborative filtering model.
A convenience wrapper around :class:ALS that sets use_eals=True by default.
eALS updates latent factors element-by-element rather than block-wise, which
is often faster and less memory-intensive for implicit datasets while yielding
comparable or better recommendation quality.
from rusket.recommenders.als import eALS
eALS(*args: 'Any', use_eals: 'bool' = True, **kwargs: 'Any') -> 'None'
Parameters
| Parameter | Type | Description |
|---|---|---|
| factors | int | Number of latent factors. |
| regularization | float | L2 regularisation weight. |
| alpha | float | Confidence scaling: confidence = 1 + alpha * r. |
| iterations | int | Number of ALS outer iterations. |
| seed | int | Random seed. |
| eals_iters | int | Number of inner iterations for eALS (default 1). |
| **kwargs | Additional arguments passed to :class:ALS. |
BPR¶
Bayesian Personalized Ranking (BPR) model for implicit feedback.
BPR optimizes for ranking rather than reconstruction error (like ALS). It works by drawing positive items the user interacted with, and negative items they haven't, and adjusting latent factors to ensure the positive item scores higher.
from rusket.recommenders.bpr import BPR
BPR(factors: 'int' = 64, learning_rate: 'float' = 0.05, regularization: 'float' = 0.01, iterations: 'int' = 150, seed: 'int' = 42, verbose: 'int' = 0, use_cuda: 'bool | None' = None, **kwargs: 'Any') -> 'None'
Parameters
| Parameter | Type | Description |
|---|---|---|
| factors | int | Number of latent factors (default: 64). |
| learning_rate | float | SGD learning rate (default: 0.05). |
| regularization | float | L2 regularization weight (default: 0.01). |
| iterations | int | Number of passes over the entire interaction dataset (default: 150). |
| seed | int | Random seed for Hogwild! SGD sampling (default: 42). |
| use_cuda | bool | If True, use CUDA acceleration (CuPy or PyTorch) for recommendation. Falls back to CPU if no CUDA backend found. Default False. |
BPR.fit¶
Fit the BPR model to the user-item interaction matrix.
Parameters
| Parameter | Type | Description |
|---|---|---|
| interactions | sparse matrix or numpy array, optional | If None, uses the matrix prepared by from_transactions(). |
Raises
| Exception | Condition |
|---|---|
| RuntimeError | |
| TypeError |
BPR.recommend_items¶
Top-N items for a user.
from rusket.recommenders.bpr import BPR.recommend_items
BPR.recommend_items(user_id: 'int', n: 'int' = 10, exclude_seen: 'bool' = True) -> 'tuple[Any, Any]'
FM¶
Factorization Machines (FM) context-aware model for predictive tasks (e.g. CTR).
This model supports binary classification tasks using Log Loss (Binary Cross Entropy). Inputs should be formatted as a scipy sparse CSR matrix where features are binary (0/1). Each row is a sample consisting of User, Item, and Context features.
from rusket.recommenders.fm import FM
FM(factors: 'int' = 8, learning_rate: 'float' = 0.05, regularization: 'float' = 0.01, iterations: 'int' = 100, seed: 'int' = 42, verbose: 'int' = 0, use_cuda: 'bool | None' = None, **kwargs: 'Any') -> 'None'
Parameters
| Parameter | Type | Description |
|---|---|---|
| factors | int | Number of latent factors for the cross terms (default: 8). |
| learning_rate | float | SGD learning rate (default: 0.05). |
| regularization | float | L2 regularization weight (default: 0.01). |
| iterations | int | Number of training epochs (default: 100). |
| seed | int | Random seed for SGD sampling (default: 42). |
| verbose | bool | Whether to print training progress (default: False). |
| use_gpu | bool | If True, use GPU acceleration (CuPy or PyTorch) for recommendation. Falls back to CPU if no GPU backend found. Default False. |
FM.fit¶
Fit the FM model to Context-aware Data.
Parameters
| Parameter | Type | Description |
|---|---|---|
| X | scipy.sparse.csr_matrix or numpy array | Sparse binary feature matrix of shape (n_samples, n_features). Each row represents a single interaction with all its context features. |
| y | numpy.ndarray | Binary target labels (0.0 or 1.0) of shape (n_samples,). |
FM.predict¶
Alias for :meth:predict_proba.
FM.predict_proba¶
Predict the probability (CTR) of interactions.
Parameters
| Parameter | Type | Description |
|---|---|---|
| X | scipy.sparse.csr_matrix or numpy array | Sparse binary feature matrix of shape (n_samples, n_features). |
Returns
| Name | Type | Description |
|---|---|---|
| numpy.ndarray | Predicted probabilities of shape (n_samples,). |
Recommender¶
Hybrid recommender combining ALS collaborative filtering, semantic similarities, and association rules.
from rusket.recommenders.recommend import Recommender
Recommender(model: 'Any | None' = None, rules_df: 'pd.DataFrame | None' = None, item_embeddings: 'np.ndarray | None' = None)
Recommender.predict_next_chunk¶
Batch-rank the next best products for every user in user_history_df.
from rusket.recommenders.recommend import Recommender.predict_next_chunk
Recommender.predict_next_chunk(user_history_df: 'pd.DataFrame', user_col: 'str' = 'user_id', k: 'int' = 5) -> 'pd.DataFrame'
Recommender.recommend_for_cart¶
Suggest items to add to an active cart using association rules.
from rusket.recommenders.recommend import Recommender.recommend_for_cart
Recommender.recommend_for_cart(cart_items: 'list[int]', n: 'int' = 5) -> 'list[int]'
Recommender.recommend_for_user¶
Top-N recommendations for a user via Hybrid ALS + Semantic.
from rusket.recommenders.recommend import Recommender.recommend_for_user
Recommender.recommend_for_user(user_id: 'int', n: 'int' = 5, alpha: 'float' = 0.5, target_item_for_semantic: 'int | None' = None) -> 'tuple[np.ndarray, np.ndarray]'
Parameters
| Parameter | Type | Description |
|---|---|---|
| user_id | int | The user ID to generate recommendations for. |
| n | int, default=5 | Number of items to return. |
| alpha | float, default=0.5 | Weight blending CF vs Semantic. alpha=1.0 is pure CF. alpha=0.0 is pure semantic. |
| target_item_for_semantic | int | None, default=None | If provided, semantic similarity is computed against this item. If None, and alpha < 1.0, it computes semantic similarity against the user's most recently interacted item (if history is available) or falls back to pure CF. |
NextBestAction¶
Hybrid recommender combining ALS collaborative filtering, semantic similarities, and association rules.
from rusket.recommenders.recommend import NextBestAction
NextBestAction(model: 'Any | None' = None, rules_df: 'pd.DataFrame | None' = None, item_embeddings: 'np.ndarray | None' = None)
NextBestAction.predict_next_chunk¶
Batch-rank the next best products for every user in user_history_df.
from rusket.recommenders.recommend import NextBestAction.predict_next_chunk
NextBestAction.predict_next_chunk(user_history_df: 'pd.DataFrame', user_col: 'str' = 'user_id', k: 'int' = 5) -> 'pd.DataFrame'
NextBestAction.recommend_for_cart¶
Suggest items to add to an active cart using association rules.
from rusket.recommenders.recommend import NextBestAction.recommend_for_cart
NextBestAction.recommend_for_cart(cart_items: 'list[int]', n: 'int' = 5) -> 'list[int]'
NextBestAction.recommend_for_user¶
Top-N recommendations for a user via Hybrid ALS + Semantic.
from rusket.recommenders.recommend import NextBestAction.recommend_for_user
NextBestAction.recommend_for_user(user_id: 'int', n: 'int' = 5, alpha: 'float' = 0.5, target_item_for_semantic: 'int | None' = None) -> 'tuple[np.ndarray, np.ndarray]'
Parameters
| Parameter | Type | Description |
|---|---|---|
| user_id | int | The user ID to generate recommendations for. |
| n | int, default=5 | Number of items to return. |
| alpha | float, default=0.5 | Weight blending CF vs Semantic. alpha=1.0 is pure CF. alpha=0.0 is pure semantic. |
| target_item_for_semantic | int | None, default=None | If provided, semantic similarity is computed against this item. If None, and alpha < 1.0, it computes semantic similarity against the user's most recently interacted item (if history is available) or falls back to pure CF. |
EASE¶
Embarrassingly Shallow Autoencoders for Sparse Data (EASE).
An implicit collaborative filtering algorithm that computes a closed-form item-item similarity matrix by solving a ridge regression problem. EASE often achieves state-of-the-art recommendation quality and very fast inference, particularly on datasets with strong item-item correlations.
from rusket.recommenders.ease import EASE
EASE(regularization: 'float' = 500.0, verbose: 'int' = 0, use_cuda: 'bool | None' = None, **kwargs: 'Any') -> 'None'
Parameters
| Parameter | Type | Description |
|---|---|---|
| regularization | float | L2 regularization weight (lambda). Higher values encourage smaller weights and reduce overfitting. Default is 500.0. |
| use_gpu | bool | If True, use GPU acceleration (CuPy or PyTorch) for recommendation. Falls back to CPU if no GPU backend found. Default False. |
EASE.fit¶
Fit the model to the user-item interaction matrix (Rust-accelerated).
Parameters
| Parameter | Type | Description |
|---|---|---|
| interactions | sparse matrix or numpy array, optional | If None, uses the matrix prepared by from_transactions(). |
EASE.recommend_items¶
Top-N items for a user. Set exclude_seen=False to include already-seen items.
from rusket.recommenders.ease import EASE.recommend_items
EASE.recommend_items(user_id: 'int', n: 'int' = 10, exclude_seen: 'bool' = True) -> 'tuple[Any, Any]'
ItemKNN¶
Ultra-fast Sparse Item-Item K-Nearest Neighbors Recommender.
Computes an item-item similarity matrix and only retains the top-K neighbors per item. Similarity methods include BM25, TF-IDF, Cosine, or unweighted Count.
from rusket.recommenders.item_knn import ItemKNN
ItemKNN(method: "Literal['bm25', 'tfidf', 'cosine', 'count']" = 'bm25', k: 'int' = 20, bm25_k1: 'float' = 1.2, bm25_b: 'float' = 0.75, verbose: 'int' = 0, use_cuda: 'bool | None' = None, **kwargs: 'Any')
ItemKNN.fit¶
Fit the ItemKNN model.
from rusket.recommenders.item_knn import ItemKNN.fit
ItemKNN.fit(interactions: 'Any' = None) -> 'ItemKNN'
Parameters
| Parameter | Type | Description |
|---|---|---|
| interactions | scipy.sparse.csr_matrix, optional | A sparse matrix of shape (n_users, n_items). If None, uses the matrix prepared by from_transactions(). |
Returns
| Name | Type | Description |
|---|---|---|
| ItemKNN | The fitted model. |
ItemKNN.recommend_items¶
Top-N items for a user.
from rusket.recommenders.item_knn import ItemKNN.recommend_items
ItemKNN.recommend_items(user_id: 'int', n: 'int' = 10, exclude_seen: 'bool' = True) -> 'tuple[Any, Any]'
Parameters
| Parameter | Type | Description |
|---|---|---|
| user_id | int | The user ID to generate recommendations for. |
| n | int, default=10 | Number of items to return. |
| exclude_seen | bool, default=True | Whether to exclude items the user has already interacted with. |
Returns
| Name | Type | Description |
|---|---|---|
| tuple[np.ndarray, np.ndarray] | (item_ids, scores) sorted by descending score. |
FPMC¶
Factorizing Personalized Markov Chains (FPMC) model for sequential recommendation.
FPMC combines Matrix Factorization (modeling user preferences) and Markov Chains (modeling sequential transitions between items). It is highly effective for tasks where both personal taste and sequential behavior matter (e.g., next-basket delivery).
from rusket.sequential.fpmc import FPMC
FPMC(factors: 'int' = 64, learning_rate: 'float' = 0.05, regularization: 'float' = 0.01, iterations: 'int' = 150, seed: 'int' = 42, time_aware: 'bool' = False, max_time_steps: 'int' = 256, verbose: 'int' = 0, use_cuda: 'bool | None' = None, **kwargs: 'Any') -> 'None'
Parameters
| Parameter | Type | Description |
|---|---|---|
| factors | int | Number of latent factors (default: 64). |
| learning_rate | float | SGD learning rate (default: 0.05). |
| regularization | float | L2 regularization weight (default: 0.01). |
| iterations | int | Number of passes over the transitions (default: 150). |
| seed | int | Random seed for sampling (default: 42). |
| verbose | bool | Whether to print training progress (default: False). |
| use_gpu | bool | If True, use GPU acceleration (CuPy or PyTorch) for recommendation. Falls back to CPU if no GPU backend found. Default False. |
FPMC.fit¶
Fit the FPMC model to a list of sequential interactions.
from rusket.sequential.fpmc import FPMC.fit
FPMC.fit(sequences: 'list[list[int]] | None' = None, timestamps: 'list[list[int]] | None' = None, n_items: 'int | None' = None) -> 'FPMC'
Parameters
| Parameter | Type | Description |
|---|---|---|
| sequences | list of list of int, optional | List of item sequences, where each sequence belongs to a unique user. Users are assigned IDs from 0 to len(sequences)-1. If None, uses data prepared by from_transactions(). |
| timestamps | list of list of int, optional | Corresponding unix timestamps for sequences if time_aware is True. |
| n_items | int | None | Maximum number of items. If None, it is inferred from data. |
FPMC.recommend_items¶
Top-N sequential items for a user.
from rusket.sequential.fpmc import FPMC.recommend_items
FPMC.recommend_items(user_id: 'int', timestamp: 'int | None' = None, n: 'int' = 10, exclude_seen: 'bool' = True) -> 'tuple[Any, Any]'
SVD¶
Funk SVD collaborative filtering model.
Biased matrix factorization trained with SGD: r̂_ui = μ + b_u + b_i + p_u · q_i
from rusket.recommenders.svd import SVD
SVD(factors: 'int' = 64, learning_rate: 'float' = 0.005, regularization: 'float' = 0.02, iterations: 'int' = 20, seed: 'int' = 42, verbose: 'int' = 0, use_cuda: 'bool | None' = None, **kwargs: 'Any') -> 'None'
Parameters
| Parameter | Type | Description |
|---|---|---|
| factors | int | Number of latent factors. |
| learning_rate | float | SGD learning rate. |
| regularization | float | L2 regularisation weight. |
| iterations | int | Number of SGD epochs. |
| seed | int | Random seed for reproducibility. |
| verbose | int | Verbosity level (0 = silent, 1+ = progress). |
| use_cuda | bool | If True, use CUDA acceleration (CuPy or PyTorch) for recommendation scoring. Falls back to CPU if no CUDA backend found. Default False. |
SVD.batch_recommend¶
Top-N items for all users efficiently computed in parallel.
from rusket.recommenders.svd import SVD.batch_recommend
SVD.batch_recommend(n: 'int' = 10, exclude_seen: 'bool' = True, format: 'str' = 'polars') -> 'Any'
Parameters
| Parameter | Type | Description |
|---|---|---|
| n | int | Number of items per user. |
| exclude_seen | bool | Whether to filter already-seen items. |
| format | str | Output format: "polars" or "pandas". |
Returns
| Name | Type | Description |
|---|---|---|
| DataFrame | A DataFrame with columns user_id, item_id, and score. |
SVD.fit¶
Fit the model to the user-item interaction matrix.
Parameters
| Parameter | Type | Description |
|---|---|---|
| interactions | scipy.sparse matrix, np.ndarray, pd.DataFrame, or polars DataFrame, optional | User-item interaction matrix with explicit ratings. If None, uses the matrix prepared by from_transactions(). |
Returns
| Name | Type | Description |
|---|---|---|
| self |
SVD.predict¶
Predict the rating for a user-item pair.
from rusket.recommenders.svd import SVD.predict
SVD.predict(user_id: 'int', item_id: 'int') -> 'float'
Parameters
| Parameter | Type | Description |
|---|---|---|
| user_id | int | User index. |
| item_id | int | Item index. |
Returns
| Name | Type | Description |
|---|---|---|
| float | Predicted rating. |
SVD.recommend_items¶
Top-N items for a user.
from rusket.recommenders.svd import SVD.recommend_items
SVD.recommend_items(user_id: 'int', n: 'int' = 10, exclude_seen: 'bool' = True) -> 'tuple[Any, Any]'
Parameters
| Parameter | Type | Description |
|---|---|---|
| user_id | int | User index. |
| n | int | Number of items to recommend. |
| exclude_seen | bool | Whether to filter already-seen items. |
Returns
| Name | Type | Description |
|---|---|---|
| tuple[np.ndarray, np.ndarray] | (item_ids, scores) |
SVD.recommend_users¶
Top-N users for an item.
from rusket.recommenders.svd import SVD.recommend_users
SVD.recommend_users(item_id: 'int', n: 'int' = 10) -> 'tuple[Any, Any]'
LightGCN¶
LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation.
A state-of-the-art collaborative filtering model that propagates embeddings over the user–item bipartite graph without non-linear transformations.
Typical training time on ml-100k: < 0.5s/epoch.
from rusket.recommenders.lightgcn import LightGCN
LightGCN(factors: 'int' = 64, k_layers: 'int' = 3, learning_rate: 'float' = 0.001, lambda_: 'float' = 0.0001, ssl_ratio: 'float' = 0.0, ssl_temp: 'float' = 0.2, ssl_weight: 'float' = 0.1, iterations: 'int' = 20, seed: 'int | None' = None, verbose: 'int' = 0, use_cuda: 'bool | None' = None, **kwargs: 'Any') -> 'None'
Parameters
| Parameter | Type | Description |
|---|---|---|
| factors | int | Embedding dimensionality (latent factors). |
| k_layers | int | Number of graph-propagation layers (1–4). |
| learning_rate | float | Adam learning rate. |
| lambda_ | float | L2 regularization coefficient. |
| iterations | int | Number of training epochs. |
| seed | int or None | Seed for reproducible training. |
| verbose | int | Print training progress. |
| use_gpu | bool | If True, use GPU acceleration (CuPy or PyTorch) for recommendation. Falls back to CPU if no GPU backend found. Default False. |
LightGCN.fit¶
Fit the model to a user-item interaction matrix.
from rusket.recommenders.lightgcn import LightGCN.fit
LightGCN.fit(interactions: 'Any' = None) -> 'LightGCN'
Parameters
| Parameter | Type | Description |
|---|---|---|
| interactions | scipy.sparse.csr_matrix or numpy.ndarray, optional | A sparse or dense user-item interaction matrix. If None, uses data prepared by from_transactions(). |
Returns
| Name | Type | Description |
|---|---|---|
| LightGCN | The fitted model. |
LightGCN.recommend_items¶
Top-N items for a user.
from rusket.recommenders.lightgcn import LightGCN.recommend_items
LightGCN.recommend_items(user_id: 'int', n: 'int' = 10, exclude_seen: 'bool' = True) -> 'tuple[np.ndarray, np.ndarray]'
Parameters
| Parameter | Type | Description |
|---|---|---|
| user_id | int | Original user ID (before encoding). |
| n | int, default=10 | Number of recommendations. |
| exclude_seen | bool, default=True | Whether to exclude items the user has already interacted with. |
Returns
| Name | Type | Description |
|---|---|---|
| tuple[np.ndarray, np.ndarray] | (item_ids, scores) sorted by descending score. |
SASRec¶
SASRec – Self-Attentive Sequential Recommendation.
Applies a causal Transformer to user interaction sequences to predict the next item. Significantly outperforms Markov-chain methods like FPMC on long sequences.
from rusket.sequential.sasrec import SASRec
SASRec(factors: 'int' = 64, n_layers: 'int' = 2, max_seq: 'int' = 50, learning_rate: 'float' = 0.0005, lambda_: 'float' = 0.0001, iterations: 'int' = 20, seed: 'int | None' = None, time_aware: 'bool' = False, max_time_steps: 'int' = 256, verbose: 'int' = 0, use_cuda: 'bool | None' = None, **kwargs: 'Any') -> 'None'
Parameters
| Parameter | Type | Description |
|---|---|---|
| factors | int | Embedding dimensionality. |
| n_layers | int | Number of Transformer blocks. |
| max_seq | int | Maximum input sequence length (older items are dropped). |
| learning_rate | float | SGD learning rate (decays during training). |
| lambda_ | float | L2 regularization. |
| iterations | int | Number of training epochs. |
| seed | int or None | Seed for reproducibility. |
| time_aware | bool | If true, incorporates timestamp deltas into sequential modeling. |
| max_time_steps | int | Maximum number of time bins (e.g. days) to consider for time-awareness. |
| verbose | int | Print epoch progress. |
| use_cuda | bool | If True, use GPU acceleration (CuPy or PyTorch) for recommendation. Falls back to CPU if no GPU backend found. Default False. |
SASRec.fit¶
from rusket.sequential.sasrec import SASRec.fit
SASRec.fit(sequences: 'list[list[int]] | None' = None, timestamps: 'list[list[int]] | None' = None) -> 'SASRec'
SASRec.recommend_items¶
Top-N items for a user or an ad-hoc sequence.
from rusket.sequential.sasrec import SASRec.recommend_items
SASRec.recommend_items(user_id: 'int | list[int]', timestamps: 'list[int] | None' = None, n: 'int' = 10, exclude_seen: 'bool' = True) -> 'tuple[np.ndarray, np.ndarray]'
Parameters
| Parameter | Type | Description |
|---|---|---|
| user_id | int or list[int] | The ID of the user (implicitly 0 to len(sequences)-1 from fit), or a list of items representing an ad-hoc sequence. |
| timestamps | list[int], optional | Corresponding unix timestamps if user_id is a list of items and time_aware=True. |
| n | int, default=10 | Number of recommendations. |
| exclude_seen | bool, default=True | Whether to exclude items the user has already interacted with. |
Returns
| Name | Type | Description |
|---|---|---|
| tuple[np.ndarray, np.ndarray] | (item_ids, scores) sorted by descending score. |
PopularityRecommender¶
Recommend items by global popularity (interaction count).
A non-personalised baseline that ranks every item by the total number of interactions it received. Useful as a sanity-check baseline when evaluating more sophisticated models.
from rusket.recommenders.popularity import PopularityRecommender
PopularityRecommender(verbose: 'int' = 0, **kwargs: 'Any') -> 'None'
Parameters
| Parameter | Type | Description |
|---|---|---|
| verbose | int, default=0 | Verbosity level. |
PopularityRecommender.fit¶
Fit the model by counting interactions per item.
from rusket.recommenders.popularity import PopularityRecommender.fit
PopularityRecommender.fit(interactions: 'Any' = None) -> 'PopularityRecommender'
Parameters
| Parameter | Type | Description |
|---|---|---|
| interactions | sparse matrix or numpy array, optional | User-item interaction matrix. If None, uses the matrix prepared by from_transactions(). |
Returns
| Name | Type | Description |
|---|---|---|
| PopularityRecommender | The fitted model. |
PopularityRecommender.recommend_items¶
Return the n most popular items for a user.
from rusket.recommenders.popularity import PopularityRecommender.recommend_items
PopularityRecommender.recommend_items(user_id: 'int', n: 'int' = 10, exclude_seen: 'bool' = True) -> 'tuple[np.ndarray, np.ndarray]'
Parameters
| Parameter | Type | Description |
|---|---|---|
| user_id | int | Internal user index. |
| n | int, default=10 | Number of items to return. |
| exclude_seen | bool, default=True | Whether to exclude items the user has already interacted with. |
Returns
| Name | Type | Description |
|---|---|---|
| tuple[np.ndarray, np.ndarray] | (item_ids, scores) sorted by descending popularity. |
ContentBased¶
Content-based recommender using TF-IDF vectorization and cosine similarity.
Recommends items similar to a given item based on textual features (descriptions, tags, genres, etc.).
from rusket.recommenders.content_based import ContentBased
ContentBased(max_features: 'int' = 5000, ngram_range: 'tuple[int, int]' = (1, 2), **kwargs: 'Any') -> 'None'
Parameters
| Parameter | Type | Description |
|---|---|---|
| max_features | int, default=5000 | Maximum number of TF-IDF features to extract. |
| ngram_range | tuple[int, int], default=(1, 2) | Range of n-grams for TF-IDF vectorisation. |
ContentBased.fit¶
Compute TF-IDF vectors and the pairwise cosine similarity matrix (Rust-accelerated).
Returns
| Name | Type | Description |
|---|---|---|
| ContentBased | The fitted model. |
ContentBased.recommend_similar¶
Find the n most similar items to a given item.
from rusket.recommenders.content_based import ContentBased.recommend_similar
ContentBased.recommend_similar(item: 'Any', n: 'int' = 10) -> 'tuple[np.ndarray, np.ndarray]'
Parameters
| Parameter | Type | Description |
|---|---|---|
| item | Any | Item ID (as it appeared in item_col of the source DataFrame). |
| n | int, default=10 | Number of similar items to return. |
Returns
| Name | Type | Description |
|---|---|---|
| tuple[np.ndarray, np.ndarray] | (item_ids, similarity_scores) sorted by descending similarity. |
HybridRecommender¶
Weighted ensemble of multiple recommendation models.
Blends the output of several pre-fitted models by combining their
recommend_items scores with configurable weights.
from rusket.recommenders.hybrid import HybridRecommender
HybridRecommender(models_and_weights: 'list[tuple[Any, float]]') -> 'None'
Parameters
| Parameter | Type | Description |
|---|---|---|
| models_and_weights | list[tuple[Any, float]] | List of (model, weight) pairs. Each model must implement recommend_items(user_id, n, exclude_seen) -> (ids, scores). |
HybridRecommender.fit¶
No-op — constituent models must be pre-fitted.
from rusket.recommenders.hybrid import HybridRecommender.fit
HybridRecommender.fit() -> 'HybridRecommender'
HybridRecommender.recommend_items¶
Blend recommendations from all constituent models.
For each model, requests a large candidate set (n * 3), maps item
scores into a shared score vector, applies the weight, and returns the
top-n from the blended result.
from rusket.recommenders.hybrid import HybridRecommender.recommend_items
HybridRecommender.recommend_items(user_id: 'int', n: 'int' = 10, exclude_seen: 'bool' = True) -> 'tuple[np.ndarray, np.ndarray]'
Parameters
| Parameter | Type | Description |
|---|---|---|
| user_id | int | Internal user index. |
| n | int, default=10 | Number of items to return. |
| exclude_seen | bool, default=True | Whether to exclude items already seen. |
Returns
| Name | Type | Description |
|---|---|---|
| tuple[np.ndarray, np.ndarray] | (item_ids, blended_scores) sorted by descending score. |
NMF¶
Non-negative Matrix Factorization for collaborative filtering.
Decomposes the user-item interaction matrix R into two non-negative
matrices W (users × factors) and H (factors × items) such that
R ≈ W @ H. The multiplicative update rules guarantee non-negativity
without a projection step.
from rusket.recommenders.nmf import NMF
NMF(factors: 'int' = 64, iterations: 'int' = 100, regularization: 'float' = 0.01, seed: 'int' = 42, verbose: 'int' = 0, use_cuda: 'bool | None' = None, **kwargs: 'Any') -> 'None'
Parameters
| Parameter | Type | Description |
|---|---|---|
| factors | int, default=64 | Number of latent factors. |
| iterations | int, default=100 | Number of multiplicative update iterations. |
| regularization | float, default=0.01 | L2 regularisation penalty applied to both W and H. |
| seed | int, default=42 | Random seed for initialisation. |
| verbose | int, default=0 | Verbosity level. |
| use_gpu | bool | If True, use GPU acceleration (CuPy or PyTorch) for recommendation. Falls back to CPU if no GPU backend found. Default False. |
NMF.fit¶
Fit via multiplicative update rules (Rust-accelerated).
Parameters
| Parameter | Type | Description |
|---|---|---|
| interactions | sparse matrix or numpy array, optional | User-item interaction matrix. If None, uses the matrix prepared by from_transactions(). |
Returns
| Name | Type | Description |
|---|---|---|
| NMF | The fitted model. |
NMF.recommend_items¶
Top-N items for a user via W @ H^T.
from rusket.recommenders.nmf import NMF.recommend_items
NMF.recommend_items(user_id: 'int', n: 'int' = 10, exclude_seen: 'bool' = True) -> 'tuple[np.ndarray, np.ndarray]'
Parameters
| Parameter | Type | Description |
|---|---|---|
| user_id | int | Internal user index. |
| n | int, default=10 | Number of items to return. |
| exclude_seen | bool, default=True | Whether to exclude already-seen items. |
Returns
| Name | Type | Description |
|---|---|---|
| tuple[np.ndarray, np.ndarray] | (item_ids, scores) sorted by descending score. |
Analytics & Utilities¶
score_potential¶
Cross-selling potential scores — shape (n_users, n_items) or (n_users, len(target_categories)).
Items the user has already interacted with are masked to -inf.
from rusket.recommenders.recommend import score_potential
score_potential(user_history: 'list[list[int]]', model: 'Any', target_categories: 'list[int] | None' = None) -> 'np.ndarray'
similar_items¶
Find the most similar items to a given item ID based on latent factors.
Computes cosine similarity between the specified item's latent vector
and all other item vectors in the item_factors matrix.
from rusket._internal.similarity import similar_items
similar_items(model: 'SupportsItemFactors', item_id: 'int', n: 'int' = 5) -> 'tuple[np.ndarray, np.ndarray]'
Parameters
| Parameter | Type | Description |
|---|---|---|
| model | SupportsItemFactors | A fitted model instance with an item_factors property. |
| item_id | int | The internal integer index of the target item. |
| n | int | Number of most similar items to return. |
Returns
| Name | Type | Description |
|---|---|---|
| tuple[np.ndarray, np.ndarray] | (item_ids, cosine_similarities) sorted in descending order. |
find_substitutes¶
Substitute/cannibalizing products via negative association rules.
Items with high individual support but low co-occurrence (lift < 1.0) likely cannibalize each other.
from rusket._internal.analytics import find_substitutes
find_substitutes(rules_df: 'pd.DataFrame', max_lift: 'float' = 0.8) -> 'pd.DataFrame'
Parameters
| Parameter | Type | Description |
|---|---|---|
| rules_df | DataFrame output from rusket.association_rules. |
|
| max_lift | Upper bound for lift; lift < 1.0 implies negative correlation. |
Returns
| Name | Type | Description |
|---|---|---|
| pd.DataFrame sorted ascending by lift (most severe cannibalization first). |
customer_saturation¶
Customer saturation by unique items/categories bought, split into deciles.
from rusket._internal.analytics import customer_saturation
customer_saturation(df: 'pd.DataFrame', user_col: 'str', category_col: 'str | None' = None, item_col: 'str | None' = None) -> 'pd.DataFrame'
Parameters
| Parameter | Type | Description |
|---|---|---|
| df | Interaction DataFrame. | |
| user_col | Column identifying the user. | |
| category_col | Category column (optional; at least one of category/item required). | |
| item_col | Item column (optional). |
Returns
| Name | Type | Description |
|---|---|---|
pd.DataFrame with unique_count, saturation_pct, and decile columns. |
export_item_factors¶
Exports latent item factors as a DataFrame for Vector DBs.
This format is ideal for ingesting into FAISS, Pinecone, or Qdrant for Retrieval-Augmented Generation (RAG) and semantic search.
from rusket.export.factors import export_item_factors
export_item_factors(model: 'SupportsItemFactors', include_labels: 'bool' = True, normalize: 'bool' = False, format: 'str' = 'pandas') -> 'Any'
Parameters
| Parameter | Type | Description |
|---|---|---|
| model | SupportsItemFactors | A fitted model instance with an item_factors property. |
| include_labels | bool, default=True | Whether to include the string item labels (if available from the model's fitting method). |
| normalize | bool, default=False | Whether to L2-normalize the factors before export. |
| format | str, default="pandas" | The DataFrame format to return. One of "pandas", "polars", or "spark". |
Returns
| Name | Type | Description |
|---|---|---|
| Any | A DataFrame where each row is an item with columns item_id, optionally item_label, and vector (a dense 1-D numpy array of the item's latent factors). |
Examples
>>> model = rusket.ALS(factors=32).fit(interactions)
>>> df = rusket.export_item_factors(model)
>>> # Ingest into FAISS / Pinecone / Qdrant
>>> vectors = np.stack(df["vector"].values)
PCA¶
Principal Component Analysis (PCA).
Linear dimensionality reduction using Singular Value Decomposition
of the centred data, computed entirely in Rust via the faer crate.
Parameters
| Parameter | Type | Description |
|---|---|---|
| n_components | int | Number of principal components to keep. |
Attributes (available after fit()) |
||
| ---------- | ||
| components_ | np.ndarray | Principal axes in feature space, shape (n_components, n_features). |
| explained_variance_ | np.ndarray | Variance explained per component (uses n - 1 degrees of freedom). |
| explained_variance_ratio_ | np.ndarray | Fraction of total variance explained per component. |
| singular_values_ | np.ndarray | Singular values corresponding to each component. |
| mean_ | np.ndarray | Per-feature empirical mean estimated from the training data. |
| n_components_ | int | Number of components that were actually fitted (may be less than requested if n_components > min(n_samples, n_features)). |
Examples
>>> import numpy as np
>>> import rusket
>>> X = np.random.default_rng(42).standard_normal((100, 10)).astype(np.float32)
>>> pca = rusket.PCA(n_components=3)
>>> pca.fit(X)
PCA(n_components=3)
>>> pca.transform(X).shape
(100, 3)
>>> pca.explained_variance_ratio_.sum() # close to fraction of total
0.4...
PCA.fit¶
Fit PCA on the data matrix X.
Parameters
| Parameter | Type | Description |
|---|---|---|
| X | array-like of shape (n_samples, n_features) | Training data. |
Returns
| Name | Type | Description |
|---|---|---|
| self |
PCA.fit_transform¶
Fit the model with X and apply dimensionality reduction.
from rusket.viz.pca import PCA.fit_transform
PCA.fit_transform(X: 'npt.NDArray[Any]') -> 'npt.NDArray[np.float32]'
Parameters
| Parameter | Type | Description |
|---|---|---|
| X | array-like of shape (n_samples, n_features) |
Returns
| Name | Type | Description |
|---|---|---|
| X_new | ndarray of shape (n_samples, n_components) |
PCA.transform¶
Apply dimensionality reduction to X.
from rusket.viz.pca import PCA.transform
PCA.transform(X: 'npt.NDArray[Any]') -> 'npt.NDArray[np.float32]'
Parameters
| Parameter | Type | Description |
|---|---|---|
| X | array-like of shape (n_samples, n_features) |
Returns
| Name | Type | Description |
|---|---|---|
| X_new | ndarray of shape (n_samples, n_components) |
Pipeline¶
Multi-stage recommendation pipeline.
Composes multiple recommendation models into a retrieve → rerank → filter funnel, following the architecture used by production recommendation systems at Twitter/X, YouTube, and Spotify.
from rusket.evaluation.pipeline import Pipeline
Pipeline(retrieve: 'Any | list[Any] | None' = None, rerank: 'Any | None' = None, rules: 'Any | list[Any] | None' = None, filter: 'Callable[[list[Any], list[float]], tuple[list[Any], list[float]]] | None' = None, merge_strategy: "Literal['max', 'mean', 'sum']" = 'max') -> 'None'
Parameters
| Parameter | Type | Description |
|---|---|---|
| retrieve | list or single model | One or more ImplicitRecommender instances used for candidate generation. Each model's recommend_items() is called and results are merged. |
| rerank | model, optional | An ImplicitRecommender used to re-score the merged candidate set. Typically a heavier model (e.g. BPR or LightGCN) that produces higher-quality rankings on a smaller candidate pool. |
| rules | model or list, optional | One or more RuleBasedRecommender instances. Rules are evaluated for the user's history and injected into the candidate set after re-ranking, with an artificially boosted score (e.g., +1,000,000) to ensure they always surface at the very top of the final recommendations. |
| filter | callable, optional | A function (item_ids, scores) -> (filtered_ids, filtered_scores) applied at the very end. Use for block lists, category restrictions, recency filters, NSFW removal, etc. |
| merge_strategy | {'max', 'mean', 'sum'}, default='max' | How to combine scores when multiple retrievers return the same item. |
Examples
>>> pipeline = Pipeline(
... retrieve=[als, item_knn],
... rerank=bpr,
... rules=my_curated_rules,
... filter=lambda ids, sc: (
... [i for i in ids if i not in blocked_set],
... [s for i, s in zip(ids, sc) if i not in blocked_set],
... ),
... )
>>> items, scores = pipeline.recommend(user_id=42, n=10)
Pipeline.recommend¶
Run the full pipeline for a single user.
from rusket.evaluation.pipeline import Pipeline.recommend
Pipeline.recommend(user_id: 'int | Any', n: 'int' = 10, exclude_seen: 'bool' = True, retrieve_k: 'int | None' = None) -> 'tuple[np.ndarray, np.ndarray]'
Parameters
| Parameter | Type | Description |
|---|---|---|
| user_id | int or any | The user to generate recommendations for. |
| n | int, default=10 | Number of final items to return. |
| exclude_seen | bool, default=True | Whether to exclude items the user has already interacted with. |
| retrieve_k | int, optional | Number of candidates per retriever. Defaults to n * 10 to produce a wide candidate pool for re-ranking. |
Returns
| Name | Type | Description |
|---|---|---|
| tuple[np.ndarray, np.ndarray] | (item_ids, scores) arrays, sorted by descending score. |
Pipeline.recommend_batch¶
Batch recommendations for multiple users.
Uses the Rust-accelerated fast path when all models expose
user_factors / item_factors and share the same user indexing.
Falls back to the Python per-user loop otherwise.
from rusket.evaluation.pipeline import Pipeline.recommend_batch
Pipeline.recommend_batch(user_ids: 'list[int | Any] | np.ndarray | None' = None, n: 'int' = 10, exclude_seen: 'bool' = True, retrieve_k: 'int | None' = None, format: 'str' = 'pandas') -> 'Any'
Parameters
| Parameter | Type | Description |
|---|---|---|
| user_ids | list or array, optional | Users to score. If None, uses all users from the first retriever. |
| n | int, default=10 | Items per user. |
| exclude_seen | bool, default=True | Whether to exclude items users have already interacted with. |
| retrieve_k | int, optional | Candidates per retriever (default: n * 10). |
| format | str, default='pandas' | Output format: 'pandas', 'polars', or 'records'. |
Returns
| Name | Type | Description |
|---|---|---|
| DataFrame or list of dicts | Columns: user_id, item_ids, scores. |
Visualization (rusket.viz)¶
Graph and visualization utilities. Requires networkx (pip install networkx).
rusket.viz.to_networkx¶
Convert a Rusket association rules DataFrame into a NetworkX Directed Graph.
Nodes represent individual items. Directed edges represent rules
(antecedent → consequent). Edge weights are set by the edge_attr
parameter (typically lift or confidence).
This is extremely useful for running community detection algorithms (e.g., Louvain, Girvan-Newman) to automatically discover product clusters, or for visualising cross-selling patterns as a force-directed graph.
from rusket.viz.plots import rusket.viz.to_networkx
rusket.viz.to_networkx(rules_df: 'pd.DataFrame', source_col: 'str' = 'antecedents', target_col: 'str' = 'consequents', edge_attr: 'str' = 'lift') -> 'networkx.DiGraph'
Parameters
| Parameter | Type | Description |
|---|---|---|
| rules_df | pd.DataFrame | A Pandas DataFrame generated by rusket.association_rules(). |
| source_col | str, default='antecedents' | Column name containing antecedents (graph edge sources). |
| target_col | str, default='consequents' | Column name containing consequents (graph edge targets). |
| edge_attr | str, default='lift' | The metric to use as edge weight/thickness. |
Returns
| Name | Type | Description |
|---|---|---|
| networkx.DiGraph | A directed graph of the association rules. If rules_df is empty, returns an empty DiGraph. |
Notes Requires the
networkxpackage (pip install networkx). When multiple rules produce the same directed edge, only the highest-weight rule is retained.
Examples
>>> import rusket
>>> G = rusket.viz.to_networkx(rules_df, edge_attr="lift")
>>> # Community detection with networkx
>>> import networkx.algorithms.community as nx_comm
>>> communities = nx_comm.greedy_modularity_communities(G.to_undirected())
Distributed Spark API (rusket.spark)¶
All functions in rusket.spark distribute computation across PySpark partitions using Apache Arrow (zero-copy) for maximum throughput.
rusket.spark.mine_grouped¶
Distribute Market Basket Analysis across PySpark partitions.
This function groups a PySpark DataFrame by group_col and applies
rusket.mine to each group concurrently across the cluster.
It assumes the input PySpark DataFrame is formatted like a dense boolean matrix (One-Hot Encoded) per group, where rows are transactions.
from rusket.integrations.spark import rusket.spark.mine_grouped
rusket.spark.mine_grouped(df: 'Any', group_col: 'str', min_support: 'float' = 0.5, max_len: 'int | None' = None, method: 'str' = 'fpgrowth', use_colnames: 'bool' = True) -> 'Any'
Parameters
| Parameter | Type | Description |
|---|---|---|
| df | The input pyspark.sql.DataFrame. |
|
| group_col | The column to group by (e.g. store_id). |
|
| min_support | Minimum support threshold. | |
| max_len | Maximum itemset length. | |
| method | Algorithm to use: 'fpgrowth', or 'eclat'. | |
| use_colnames | If True, returns item names instead of column indices. Must be True for PySpark applyInArrow schema consistency. |
Returns
| Name | Type | Description |
|---|---|---|
| pyspark.sql.DataFrame | A PySpark DataFrame containing: - group_col - support (float) - itemsets (array of strings) |
rusket.spark.rules_grouped¶
Distribute Association Rule Mining across PySpark partitions.
This takes the frequent itemsets DataFrame (output of mine_grouped)
and applies association_rules uniformly across the groups.
from rusket.integrations.spark import rusket.spark.rules_grouped
rusket.spark.rules_grouped(df: 'Any', group_col: 'str', num_itemsets: 'dict[Any, int] | int', metric: 'str' = 'confidence', min_threshold: 'float' = 0.8) -> 'Any'
Parameters
| Parameter | Type | Description |
|---|---|---|
| df | The PySpark DataFrame containing frequent itemsets. |
|
| group_col | The column to group by. | |
| num_itemsets | A dictionary mapping group IDs to their total transaction count, or a single integer if all groups have the same number of transactions. | |
| metric | The metric to filter by (e.g. "confidence", "lift"). | |
| min_threshold | The minimal threshold for the evaluation metric. |
Returns
| Name | Type | Description |
|---|---|---|
| pyspark.sql.DataFrame | A DataFrame containing antecedents, consequents, and all rule metrics, prepended with the group_col. |
rusket.spark.prefixspan_grouped¶
Distribute Sequential Pattern Mining (PrefixSpan) across PySpark partitions.
This function groups a PySpark DataFrame by group_col and applies
PrefixSpan.from_transactions to each group concurrently across the cluster.
from rusket.integrations.spark import rusket.spark.prefixspan_grouped
rusket.spark.prefixspan_grouped(df: 'Any', group_col: 'str', user_col: 'str', time_col: 'str', item_col: 'str', min_support: 'int' = 1, max_len: 'int | None' = None) -> 'Any'
Parameters
| Parameter | Type | Description |
|---|---|---|
| df | The input pyspark.sql.DataFrame. |
|
| group_col | The column to group by (e.g. store_id). |
|
| user_col | The column identifying the sequence within each group (e.g., user_id or session_id). |
|
| time_col | The column used for ordering events within a sequence. | |
| item_col | The column containing the items. | |
| min_support | The minimum absolute support (number of sequences a pattern must appear in). | |
| max_len | Maximum length of the sequential patterns to mine. |
Returns
| Name | Type | Description |
|---|---|---|
| pyspark.sql.DataFrame | A PySpark DataFrame containing: - group_col - support (long/int64) - sequence (array of strings) |
rusket.spark.hupm_grouped¶
Distribute High-Utility Pattern Mining (HUPM) across PySpark partitions.
This function groups a PySpark DataFrame by group_col and applies
HUPM.from_transactions to each group concurrently across the cluster.
from rusket.integrations.spark import rusket.spark.hupm_grouped
rusket.spark.hupm_grouped(df: 'Any', group_col: 'str', transaction_col: 'str', item_col: 'str', utility_col: 'str', min_utility: 'float', max_len: 'int | None' = None) -> 'Any'
Parameters
| Parameter | Type | Description |
|---|---|---|
| df | The input pyspark.sql.DataFrame. |
|
| group_col | The column to group by (e.g. store_id). |
|
| transaction_col | The column identifying the transaction within each group. | |
| item_col | The column containing the numeric item IDs. | |
| utility_col | The column containing the numeric utility (e.g., profit) of the item in the transaction. | |
| min_utility | The minimum total utility required to consider a pattern "high-utility". | |
| max_len | Maximum length of the itemsets to mine. |
Returns
| Name | Type | Description |
|---|---|---|
| pyspark.sql.DataFrame | A PySpark DataFrame containing: - group_col - utility (double/float64) - itemset (array of longs/int64) |
rusket.spark.recommend_batches¶
Distribute Batch Recommendations across PySpark partitions.
This function uses mapInArrow to process partitions of users concurrently,
applying a pre-fitted Recommender (or ALS) to each chunk.
from rusket.integrations.spark import rusket.spark.recommend_batches
rusket.spark.recommend_batches(df: 'Any', model: 'Any', user_col: 'str' = 'user_id', k: 'int' = 5) -> 'Any'
Parameters
| Parameter | Type | Description |
|---|---|---|
| df | The PySpark DataFrame containing user histories (must contain user_col). |
|
| model | The pre-trained Recommender or ALS model instance to use for scoring. |
|
| user_col | The column identifying the user. | |
| k | The number of top recommendations to return per user. |
Returns
| Name | Type | Description |
|---|---|---|
| pyspark.sql.DataFrame | A DataFrame with two columns: - user_col - recommended_items (array of longs/int64) |
rusket.spark.to_spark¶
Convert a Pandas or Polars DataFrame into a PySpark DataFrame.
from rusket.integrations.spark import rusket.spark.to_spark
rusket.spark.to_spark(spark_session: 'Any', df: 'Any') -> 'Any'
Parameters
| Parameter | Type | Description |
|---|---|---|
| spark_session | The active PySpark SparkSession. |
|
| df | The pd.DataFrame or pl.DataFrame to convert. |
Returns
| Name | Type | Description |
|---|---|---|
| pyspark.sql.DataFrame | The resulting PySpark DataFrame. |