简体   繁体   中英

Odd results from cSPADE in R (arulesSequences) w/ large data. Can I force numpart to 1? Are there risks?

I've been trying to use cSPADE on a dataset I have with ~7million records in my transactions file (7 million unique sequenceID x eventID pairs). The support results I get when I try to run cSPADE on this dataset seem completely wrong. However, when I use ~86,000 records (the head of the previous file, more or less), the results look right. I've noticed that till this point the verbose log prints out that only 1 partition is used, while when I try ~850,000 records, 3 partitions are used.

Verbose output when using 100,000 records (with reasonable looking results):

> s1 <- cspade(trans, parameter = list(support = 0.1,maxlen=1), control = list(verbose = TRUE))

parameter specification:
support : 0.1
maxsize :  10
maxlen  :   1

algorithmic control:
bfstype  : FALSE
verbose  :  TRUE
summary  : FALSE
tidLists : FALSE

preprocessing ... 1 partition(s), 1.98 MB [0.7s]
mining transactions ... 0 MB [0.21s]
reading sequences ... [0.03s]

total elapsed time: 0.94s

> summary(s1)
set of 14 sequences with

most frequent items:
      A       B       C       D       E (Other) 
      2       2       1       1       1       8 

.
.
.
summary of quality measures:
    support      
 Min.   :0.1306  
 1st Qu.:0.3701  
 Median :0.7021  
 Mean   :0.5773  
 3rd Qu.:0.7184  
 Max.   :0.9903  

includes transaction ID lists: FALSE 

mining info:
  data ntransactions nsequences support
 trans         83686      10059     0.1

Verbose output when using 1000,000 records (with wrong looking results):

> s1 <- cspade(trans, parameter = list(support = 0.1,maxlen=1), control = 
list(verbose = TRUE))

parameter specification:
support : 0.1
maxsize :  10
maxlen  :   1

algorithmic control:
bfstype  : FALSE
verbose  :  TRUE
summary  : FALSE
tidLists : FALSE

preprocessing ... 3 partition(s), 19.55 MB [4.6s]
mining transactions ... 0 MB [0.6s]
reading sequences ... [0.01s]

total elapsed time: 5.19s

> summary(s1)

set of 0 sequences with

most frequent items:
integer(0)

most frequent elements:
integer(0)

element (sequence) size distribution:
< table of extent 0 >

sequence length distribution:
< table of extent 0 >

summary of quality measures:
< table of extent 0 >

includes transaction ID lists: FALSE 

mining info:
  data ntransactions nsequences support
 trans        826830      96238     0.1

I found that I can set the number of partitions to 1 when calling cSPADE and that fixed the problem. However cSPADE does output a warning saying:

s1 <- cspade(trans, parameter = list(support = 0.1,maxlen=1), control = list(verbose = TRUE,numpart=1))

Warning message: In cspade(trans, parameter = list(support = 0.1, maxlen = 1), control = list(verbose = TRUE,  :  'numpart' less than recommended

Do I need to heed this warning? What are the downsides of setting numpart=1 (forcing #partitions to be 1)? If there is, is there any way for me to get right answers without controlling this parameter?

For the benefit of others who may run into the same problem. I ended up emailing the author the package. He said this was not a known issue and suggested that i stick to numpart=1.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM