R crashes when cspade is trained on a large data set

Question

The codes below works to extract sequences using the cspade algorithm.

library("arulesSequences")
df <- data.frame(personID = c(1, 1, 2, 2, 2),
         eventID = c(100, 101, 102, 103, 104),
         site = c("google", "facebook", "facebook", "askjeeves", "stackoverflow"),
         sequence = c(1, 2, 1, 2, 3))

df.trans <- as(df[,"site", drop = FALSE], "transactions")
transactionInfo(df.trans)$sequenceID <- df$sequence
transactionInfo(df.trans)$eventID <- df$eventID
df.trans <- df.trans[order(transactionInfo(df.trans)$sequenceID),]
seq <- cspade(df.trans, parameter = list(support = 0.2), 
          control = list(verbose = TRUE))

The problem is that my actual data is ~2 million rows, with sequence increasing to ~20 for each person. Using the code above, cspade quickly consumes all RAM and R crashes. Anyone have tips on how to perform sequence mining on large datasets like mine? Thanks!

Answer 1

How many unique IDs do you have in df$sequence ? It looks like in the last column of your sample dataset that there are 3 sequence options. Do you think sequences of up to 20 are necessary? One thing you could do is set the maxlen parameter in your cspade function call to something like 4 or 5 and evaluate your predictive accuracy, assuming that's what you are after.

So you would have something like seq <- cspade(df.trans, parameter = list(support = 0.2, maxlen = 4),control = list(verbose = TRUE)) .

Hope that helps

R crashes when cspade is trained on a large data set

Question

1 answers

solution1
0 2019-02-13 18:34:09

R crashes when cspade is trained on a large data set

Question

1 answers

solution1 0 2019-02-13 18:34:09

solution1
0 2019-02-13 18:34:09