Sequential Pattern Mining (PrefixSpan)¶
In standard Market Basket Analysis, we look at the items inside a single checkout. However, if we want to model lifecycle purchasing or churn behavior, we need an algorithm that natively understands time.
In this cookbook, we will mine Sequential Patterns using rusket's blazing fast PrefixSpan implementation over an e-commerce clickstream log.
1. The E-Commerce Event Log¶
We start with a classic log of distinct user events over time. This could be page views, checkout events, or support tickets.
events = pd.DataFrame(
{
"user_id": [1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4],
"timestamp": [
"2024-01-01 10:00",
"2024-01-05 10:05",
"2024-01-10 10:10",
"2024-01-02 11:00",
"2024-01-07 11:05",
"2024-01-03 09:00",
"2024-01-04 09:05",
"2024-01-09 09:10",
"2024-01-01 12:00",
"2024-01-08 12:00",
"2024-01-15 12:00",
],
"event_name": [
"signup",
"view_product",
"add_to_cart",
"signup",
"view_product",
"signup",
"view_product",
"checkout",
"view_product",
"checkout",
"churn",
],
}
)
# Ensure correct temporal ordering
events["timestamp"] = pd.to_datetime(events["timestamp"])
events.sort_values(["user_id", "timestamp"], inplace=True)
events.head()
| user_id | timestamp | event_name | |
|---|---|---|---|
| 0 | 1 | 2024-01-01 10:00:00 | signup |
| 1 | 1 | 2024-01-05 10:05:00 | view_product |
| 2 | 1 | 2024-01-10 10:10:00 | add_to_cart |
| 3 | 2 | 2024-01-02 11:00:00 | signup |
| 4 | 2 | 2024-01-07 11:05:00 | view_product |
2. Compiling the Sequential Database¶
Rusket requires data grouped into discrete sequential arrays of integers per user. We provide a sequences_from_event_log helper to automatically convert your Pandas DataFrame into this required zero-copy format.
sequences, label_mapping = sequences_from_event_log(
events, user_col="user_id", time_col="timestamp", item_col="event_name"
)
print(f"Compiled {len(sequences)} distinct user sequences.")
print(f"Internal Mapping Table: {label_mapping}")
Compiled 4 distinct user sequences.
Internal Mapping Table: {0: 'signup', 1: 'view_product', 2: 'add_to_cart', 3: 'checkout', 4: 'churn'}
3. Mining Sequential Patterns¶
Now we pass our compiled sequences into the prefixspan model. We will ask for patterns that happen to at least 2 independent users.
# Mine patterns
t0 = time.time()
patterns_df = prefixspan(sequences, min_support=2)
print(f"Found {len(patterns_df)} sequential patterns in {time.time() - t0:.4f}s!")
# Restore the human-readable labels from our internal `label_mapping`
patterns_df["event_path"] = patterns_df["sequence"].apply(lambda seq: " → ".join([label_mapping[idx] for idx in seq]))
# Display the most frequent sequences
patterns_df.sort_values("support", ascending=False)[["support", "event_path"]]
Found 5 sequential patterns in 0.0013s!
| support | event_path | |
|---|---|---|
| 0 | 4 | view_product |
| 1 | 3 | signup |
| 2 | 3 | signup → view_product |
| 3 | 2 | view_product → checkout |
| 4 | 2 | checkout |
Using these sequential outputs, businesses can automatically map out the 'Happy Path' to checkout vs the 'Failure Path' leading to churn.