简体   繁体   中英

R - For loop for apriori Algorithm

Today a question about a for loop filled with the data mining apriori Algorithm. I'am working on the analysis of the results in a apriori Algorithm but, as you already know, the two main parameters (confidence and support) of the algorithm are setted before, without knowing the results. This means sometimes you ought to have to try different combinations of parameters to reach a satysfing result. I decided to try to set a for loop in R, with this type of result I intend to reach:

vector  s  c
x1      y1 z1
x2      y1 z2
x3      y1 z3
x4      y2 z1
x5      y2 z2
x6      y2 z3
...
xn      yn zn

with the vector of the x as the number of rules created, the vector s with the support parameter (0<=s<=1), and c the confidence parameter (0<=s<=1). This means that for each value I want of the support per each level I want of the confidence, I'll have the number of the rules created, all stored in a nice data frame of three columns.

Clearly I started by myself to find the solution. I've thought that the two parameters should be a pair of sequences so, having no idea of doing a for loop with two sequencies, and using one of my old question:

for loop with decimals and store results in a vector

I tried to make a simple for loop with only one "moving" parameters, with the second fixed. First of all I created some fake data, useful because very small.

# here the data
id <- c("1","1","1","2","2","2","3","3","3")
obj <- c("a", "b", "j", "a", "g","c", "a","k","c")
df <- data.frame(id,obj)

Then, a conversion, to make the data digestible for the apriori function of arules package:

# here the rewritten data
library(arules)
transactions <- as(split(df$obj, df$id), "transactions")
inspect(transactions)

And last, the function with only one moving parameter, the support:

  test <- function(x, y1, y2, y3, z){

# the sequence for the support
  s <- seq(y1, y2, by = y3)

# empty vector
  my_vector <- vector("numeric")

# for loop with moving support (in the seq) and fixed confidence
  for(i in seq_along(s)){my_vector <- nrow( data.frame(

# this is a small trick to have the row of the rules, do not know if it is perfect
  labels(lhs(apriori(x,parameter=list(supp = s[i], conf = z))))))} 
my_vector

# put the result in a data frame
data <- data.frame (vector = as.numeric(my_vector),s = as.numeric(s))
return(data)
}

And here the first application with some result:

# the function applied
test(transactions, 0.01, 0.1, 0.01, 0.1)

# the result: the apriori function generates also its output, avoided here
   vector    s
1      31 0.01
2      31 0.02
3      31 0.03
4      31 0.04
5      31 0.05
6      31 0.06
7      31 0.07
8      31 0.08
9      31 0.09
10     31 0.10

And if you submit this

apriori(transactions,parameter=list(supp = 0.01, conf = 0.1))
apriori(transactions,parameter=list(supp = 0.1, conf = 0.1))

the results are coherent.

Now the difficult part (to me). I would like also the confidence parameter to vary. I studied a bit this:

Including multiple conditions in for-loop

But I got a great limitation, I cannot imagine how I could apply it. I could make vary the first parameter, and for each value try to make "moving" the second. In this case if the support vary between 0.1 and 0.01 by 0.01, and so the confidence, the result should be a vector of 100 rows.

Also, I have some technical issue, I am not capable to do such thing mentioned. I know that this procedure could be a bit harsh for the machine, but I would like to have one that is capable to be used.

I'd like to have an help, and thanks in advance for your time.

With dplyr .
First, create a grid of parameters.
Then build a model for each combination of parameters, and store it in a list-column (useful for further computations).
Then use the length() function on each model, which seems to do exactly what you want with your "small trick":

grid <- expand.grid(support = seq(0.01, 0.1, 0.01),
                    confidence = seq(0.01, 0.1, 0.01))
library(dplyr)
res <- 
  grid %>% 
  group_by(support, confidence) %>% 
  do(model = apriori(
    transactions,
    parameter = list(support = .$support, confidence = .$confidence)
  )) %>% 
  mutate(n_rules = length(model)) %>%
  ungroup()

# # A tibble: 100 × 4
#    support confidence       model n_rules
#      <dbl>      <dbl>      <list>   <int>
# 1     0.01       0.01 <S4: rules>      31
# 2     0.01       0.02 <S4: rules>      31
# 3     0.01       0.03 <S4: rules>      31
# 4     0.01       0.04 <S4: rules>      31
# 5     0.01       0.05 <S4: rules>      31
# 6     0.01       0.06 <S4: rules>      31
# 7     0.01       0.07 <S4: rules>      31
# 8     0.01       0.08 <S4: rules>      31
# 9     0.01       0.09 <S4: rules>      31
# 10    0.01       0.10 <S4: rules>      31
# # ... with 90 more rows

You may want to re-use each model. Since they're all stored in your resulting dataframe, it should be more convenient.
To examine a single model, you could do for instance:

summary(res$model[res$confidence == 0.03 & res$support == 0.04][[1]])

# set of 31 rules
# 
# rule length distribution (lhs + rhs):sizes
#  1  2  3 
#  6 16  9 
# 
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   1.000   2.000   2.000   2.097   3.000   3.000 
# 
# summary of quality measures:
#     support         confidence          lift      
#  Min.   :0.3333   Min.   :0.3333   Min.   :1.000  
#  1st Qu.:0.3333   1st Qu.:0.4167   1st Qu.:1.000  
#  Median :0.3333   Median :1.0000   Median :1.000  
#  Mean   :0.3871   Mean   :0.7419   Mean   :1.387  
#  3rd Qu.:0.3333   3rd Qu.:1.0000   3rd Qu.:1.500  
#  Max.   :1.0000   Max.   :1.0000   Max.   :3.000  
# 
# mining info:
#          data ntransactions support confidence
#  transactions             3    0.04       0.03

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM