简体   繁体   中英

R crashes when cspade is trained on a large data set

The codes below works to extract sequences using the cspade algorithm.

library("arulesSequences")
df <- data.frame(personID = c(1, 1, 2, 2, 2),
         eventID = c(100, 101, 102, 103, 104),
         site = c("google", "facebook", "facebook", "askjeeves", "stackoverflow"),
         sequence = c(1, 2, 1, 2, 3))

df.trans <- as(df[,"site", drop = FALSE], "transactions")
transactionInfo(df.trans)$sequenceID <- df$sequence
transactionInfo(df.trans)$eventID <- df$eventID
df.trans <- df.trans[order(transactionInfo(df.trans)$sequenceID),]
seq <- cspade(df.trans, parameter = list(support = 0.2), 
          control = list(verbose = TRUE))

The problem is that my actual data is ~2 million rows, with sequence increasing to ~20 for each person. Using the code above, cspade quickly consumes all RAM and R crashes. Anyone have tips on how to perform sequence mining on large datasets like mine? Thanks!

How many unique IDs do you have in df$sequence ? It looks like in the last column of your sample dataset that there are 3 sequence options. Do you think sequences of up to 20 are necessary? One thing you could do is set the maxlen parameter in your cspade function call to something like 4 or 5 and evaluate your predictive accuracy, assuming that's what you are after.

So you would have something like seq <- cspade(df.trans, parameter = list(support = 0.2, maxlen = 4),control = list(verbose = TRUE)) .

Hope that helps

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM