简体   繁体   English

R:Apriori 算法找不到任何关联规则

[英]R: Apriori Algorithm does not find any association rules

I generated a dataset holding two distinct columns: an ID column associated to a customer and another column associated to his/her active products:我生成了一个包含两个不同列的数据集:一个与客户关联的 ID 列和与他/她的活动产品关联的另一个列:

head(df_itemList)

      ID      PRD_LISTE
1     1       A,B,C
3     2       C,D
4     3       A,B
5     4       A,B,C,D,E
7     5       B,A,D
8     6       A,C,D

I only selected customers that own more than one product.我只选择了拥有不止一种产品的客户。 In total I have 589.454 rows and there are 16 different products.我总共有 589.454 行,有 16 种不同的产品。

Next, I wrote the data.frame into an csv-file like this:接下来,我将 data.frame 写入 csv 文件,如下所示:

df_itemList$ID <- NULL
colnames(df_itemList) <- c("itemList")
write.csv(df_itemList, "Basket_List_13-08-2020.csv", row.names = TRUE)

Then, I converted the csv-file into a basket format in order to apply the apriori algorithm as implemented in the arules-package.然后,我将 csv 文件转换为篮子格式,以便应用 arules 包中实现的先验算法。

library(arules)  
txn <- read.transactions(file="Basket_List_13-08-2020.csv", 
                         rm.duplicates= TRUE, format="basket",sep=",",cols=1)
txn@itemInfo$labels <- gsub("\"","",txn@itemInfo$labels)

The summary-function yields the following output:摘要函数产生以下 output:

summary(txn)
transactions as itemMatrix in sparse format with
 589455 rows (elements/itemsets/transactions) and
 1737 columns (items) and a density of 0.0005757052 

most frequent items:
                   A,C                    A,B                     C,F                     C,D
                  57894                   32150                   31367                   29434 
                  A,B,C                 (Other) 
                  29035                  409575 

element (itemset/transaction) length distribution:
sizes
     1 
589455 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       1       1       1       1       1 

includes extended item information - examples:
                                                                             labels
1 G,H,I,A,B,C,D,F,J
2 G,H,I,A,B,C,F
3 G,H,I,A,B,K,D

includes extended transaction information - examples:
  transactionID
1              
2             1
3             3

Now, I tried to run the apriori-algorithm:现在,我尝试运行先验算法:

basket_rules <- apriori(txn, parameter = list(sup = 1e-15, 
                                              conf = 1e-15, minlen = 2, target="rules"))

This is the output:这是 output:

   Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen maxlen target  ext
       0.01    0.1    1 none FALSE            TRUE       5   1e-15      2     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 0 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[1737 item(s), 589455 transaction(s)] done [0.20s].
sorting and recoding items ... [1737 item(s)] done [0.00s].
creating transaction tree ... done [0.16s].
checking subsets of size 1 done [0.00s].
writing ... [0 rule(s)] done [0.00s].
creating S4 object  ... done [0.04s].

Even with a ridiculously low support and confidence, no rules are generated...即使支持和信心低得离谱,也不会产生任何规则......

summary(basket_rules)
set of 0 rules

Is this really because of my dataset?这真的是因为我的数据集吗? Or was there a mistake in my code?还是我的代码有错误?

Your summary shows that the data is not read in correctly:您的摘要显示数据未正确读取:

most frequent items:
                   A,C                    A,B                     C,F                     C,D
                  57894                   32150                   31367                   29434 
                  A,B,C                 (Other) 
                  29035                  409575 

Looks like "A,C" is read as an item, but it should be two items "A" and "C".看起来像“A,C”被读取为一个项目,但它应该是两个项目“A”和“C”。 The separating character does not work.分隔符不起作用。 I assume that could be because of quotation marks in the file.我认为这可能是因为文件中的引号。 Make sure that Basket_List_13-08-2020.csv looks correct.确保Basket_List_13-08-2020.csv看起来正确。 Also, you need to skip the first line (headers) using skip = 1 when you read the transactions.此外,您需要在阅读事务时使用skip = 1跳过第一行(标题)。

@Michael I am quite positive now that there is something wrong with the.csv-file I am reading in. Since there are others who experienced similar problems my guess is that this is the common reason for error. @Michael我现在很肯定我正在阅读的.csv文件有问题。由于还有其他人遇到过类似的问题,我的猜测是这是错误的常见原因。 Can you please describe how the.csv-file should look like when read in?你能描述一下.csv文件在读入时的样子吗?

When typing in data <- read.csv("file.csv", header = TRUE, sep = ",") I get the following data.frame:当输入data <- read.csv("file.csv", header = TRUE, sep = ",")我得到以下data.frame:

X     Prd
1     A
2     A,B
3     B,A
4     B
5     C

Is it correct that - if there are multiple products for a customer X - these products are all written in a single column?是否正确 - 如果客户 X 有多个产品 - 这些产品都写在一个列中? Or should be written in different columns?还是应该写在不同的列?

Furthermore, when writing txn <- read.transactions(file="Versicherungen2_ItemList_Short.csv", rm.duplicates= TRUE, format="basket",sep=",",cols=1, skip=1) and summary(txn) I see the following problem:此外,在编写txn <- read.transactions(file="Versicherungen2_ItemList_Short.csv", rm.duplicates= TRUE, format="basket",sep=",",cols=1, skip=1)summary(txn)我看到以下问题:

most frequent items:
A             B            C           A,B            B,A
1256          1235         456         235            125

(numbers are chosen randomly) (数字是随机选择的)

So the read.transaction function differentiates between A,B and B,A... So I am guessing there is something wrong with the.csv-file.所以 read.transaction function 区分 A,B 和 B,A... 所以我猜 .csv 文件有问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM