简体   繁体   English

使用arules和apriori为SparkR中的关联规则挖掘构建“事务”类

[英]Building the “transactions” Class for Association Rule Mining in SparkR using arules and apriori

I am using SparkR and trying to convert a "SparkDataFrame" to "transactions" in order to mine association of items/ products. 我正在使用SparkR并尝试将“ SparkDataFrame”转换为“交易”,以挖掘商品/产品的关联。

I have found a similar example on this link https://blog.aptitive.com/building-the-transactions-class-for-association-rule-mining-in-r-using-arules-and-apriori-c6be64268bc4 but this is only if you are working with an R data.frame. 我在此链接https://blog.aptitive.com/building-the-transactions-class-for-association-rule-mining-in-r-using-arules-and-apriori-c6be64268bc4中找到了类似的示例,但这仅当您使用R data.frame时。 I currently have my data in this format; 我目前以这种格式保存数据;

CUSTOMER_KEY_h PRODUCT_CODE

    1   SAVE
    1   CHEQ
    1   LOAN
    1   LOAN
    1   CARD
    1   SAVE
    2   CHEQ
    2   LOAN
    2   CTSAV
    2   SAVE
    2   CHEQ
    2   SAVE
    2   CARD
    2   CARD
    3   LOAN
    3   CTSAV
    4   SAVE
    5   CHEQ
    5   SAVE
    5   CARD
    5   LOAN
    5   CARD
    6   CHEQ
    6   CHEQ

and would like to end up with something like this; 并希望最终得到这样的结果;

CUSTOMER_KEY_h  PRODUCT_CODE
    1          {SAVE, CHEQ, LOAN, LOAN , CARD, SAVE}
    2          {CHEQ, LOAN, CTSAV, SAVE, CHEQ, SAVE, CARD, CARD}
    3          {LOAN, CTSAV}
    4          {SAVE}
    5          {CHEQ, SAVE, CARD, LOAN, CARD}
    6          {CHEQ, CHEQ}

Alternatively, If I can get the equivalent of this R script in SparkR df2 <- apply(df,2,as.logical) that would be helpful. 或者,如果我可以在SparkR df2 <- apply(df,2,as.logical) -apply df2 <- apply(df,2,as.logical)中获得与该R脚本等效的内容df2 <- apply(df,2,as.logical)那将是有帮助的。

arules package is not compatible with SparkR. arules软件包与SparkR不兼容。 If you want to explore association rules on Spark, you should use it's own utilities. 如果要在Spark上浏览关联规则,则应使用它自己的实用程序。 First use collect_set to combine records: 首先使用collect_set合并记录:

library(magrittr)

df <- createDataFrame(data.frame(
  CUSTOMER_KEY_h = c(
    1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5, 5, 6, 6),
  PRODUCT_CODE = c(
    "SAVE","CHEQ","LOAN","LOAN","CARD","SAVE","CHEQ","LOAN","CTSAV","SAVE",
    "CHEQ","SAVE","CARD","CARD","LOAN","CTSAV","SAVE","CHEQ","SAVE","CARD","LOAN",
    "CARD","CHEQ","CHEQ")
))

baskets <- df %>% 
  groupBy("CUSTOMER_KEY_h") %>% 
  agg(alias(collect_set(column("PRODUCT_CODE")), "items"))

Fit the model (please check spark.fpGrowth docs for the full list of the available options): 拟合模型(请检查spark.fpGrowth文档以获取可用选项的完整列表):

fpgrowth <- spark.fpGrowth(baskets)

and use it to extract association rules: 并使用它来提取关联规则:

arules <- fpgrowth <- spark.fpGrowth(baskets)

arules %>% head()
        antecedent consequent confidence lift                                   
1       CARD, LOAN       SAVE          1  1.5
2       CARD, LOAN       CHEQ          1  1.5
3 LOAN, SAVE, CHEQ       CARD          1  2.0
4       SAVE, CHEQ       LOAN          1  1.5
5       SAVE, CHEQ       CARD          1  2.0
6       CARD, SAVE       LOAN          1  1.5

If you use Spark < 2.3.0 you can try replacing: 如果使用Spark <2.3.0,则可以尝试替换:

alias(collect_set(column("PRODUCT_CODE")), "items")

with

expr("collect_set(PRODUCT_CODE) AS items")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM