简体   繁体   English

R中的关联规则 - 删除冗余规则(arules)

[英]Association rule in R - removing redundant rule (arules)

Assume we have 3 rules: 假设我们有3条规则:

[1] {A,B,D} -> {C}

[2] {A,B} -> {C}

[3] Whatever it is

Rule [2] is a subset of rule [1] (because rule [1] contains all the items in rule [2] ), so rule [1] should be eliminated (because rule [1] is too specific and its information is included in rule [2] ) [2]是规则的一个子集[1]因为规则[1]包含在规则中的所有项目[2]故治[1]应该被淘汰(因为规则[1]是过于具体,它的信息是包含在规则[2]

I searched through the internet and everyone is using these code to remove redundant rules: 我在互联网上搜索,每个人都在使用这些代码来删除多余的规则:

subset.matrix <- is.subset(rules.sorted, rules.sorted)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1
which(redundant)
rules.pruned <- rules.sorted[!redundant]

I dont understand how the code work. 我不明白代码是如何工作的。

After line 2 of the code, the subset.matrix will become: 在代码的第2行之后,subset.matrix将变为:

      [,1] [,2] [,3]
[1,]   NA    1    0
[2,]   NA   NA    0
[3,]   NA   NA   NA

The cells in the lower triangle are set to be NA and since rule [2] is a subset of rule [1] , the corresponding cell is set to 1. So I have 2 questions: 下三角形中的单元格设置为NA,并且由于规则[2]是规则[1]的子集,因此相应的单元格设置为1.所以我有两个问题:

  1. Why do we have to set the lower triangle as NA? 为什么我们必须将下三角设置为NA? If we do so then how can we check whether rule [2] is subset of rule [3] or not? 如果我们这样做,那么我们如何检查规则[2]是否是规则[3]子集? (the cell has been set as NA) (单元格已设置为NA)

  2. In our case, rule [1] should be the one to be eliminated, but these code eliminate rule [2] instead of rule [1] . 在我们的例子中,规则[1]应该是要消除的规则,但是这些代码消除了规则[2]而不是规则[1] (Because the first cell in column 2 is 1, and according to line 3 of the code, the column sums of column 2 >= 1, therefore will be treated as redundant) (因为第2列中的第一个单元格是1,并且根据代码的第3行,第2列的列总和> = 1,因此将被视为冗余)

Any help would be appreciated !! 任何帮助,将不胜感激 !!

For your code to work you need an interest measure (confidence or lift) and rules.sorted needs to be sorted by either confidence or lift. 为了使您的代码工作,您需要一个兴趣度量(置信度或提升)和rules.sorted需要按置信度或提升进行排序。 Anyway, the code is horribly inefficient since is.subset() creates a matrix of size n^2, where n is the number of rules. 无论如何,代码非常低效,因为is.subset()创建一个大小为n ^ 2的矩阵,其中n是规则的数量。 Also, is.subset for rules merges rhs and lhs of the rule which is not correct. 此外, is.subset for rules合并规则的rhs和lhs,这是不正确的。 So don't worry too much about the implementation details. 所以不要过多担心实现细节。

A more efficient way to do this is now implemented as function is.redundant() in package arules (available in version 1.4-2). 现在,一种更有效的方法是在包arules中实现函数is.redundant() (在版本1.4-2中可用)。 This explanation comes from the manual page: 此解释来自手册页:

A rule is redundant if a more general rules with the same or a higher confidence exists. 如果存在具有相同或更高置信度的更一般规则,则规则是多余的。 That is, a more specific rule is redundant if it is only equally or even less predictive than a more general rule. 也就是说,如果一个更具体的规则与更一般的规则相同或甚至更不具有预测性,则该规则是多余的。 A rule is more general if it has the same RHS but one or more items removed from the LHS. 如果规则具有相同的RHS但从LHS中删除了一个或多个项目,则规则更为通用。 Formally, a rule X -> Y is redundant if 形式上,规则X - > Y是多余的,如果

for some X' subset X, conf(X' -> Y) >= conf(X -> Y). 对于某些X'子集X,conf(X' - > Y)> = conf(X - > Y)。

This is equivalent to a negative or zero improvement as defined by Bayardo et al. 这相当于Bayardo等人定义的负或零改进。 (2000). (2000年)。 In this implementation other measures than confidence, eg improvement of lift, can be used as well. 在该实施方式中,也可以使用除置信度之外的其他措施,例如升力的改善。

Check out the examples in ? is.redundant 查看示例中的? is.redundant ? is.redundant . ? is.redundant

Remove redundant rules with arules package... 使用arules包删除冗余规则...

Run apriori algorithm: 运行apriori算法:

rules <- apriori(transDat, parameter = list(supp = 0.01, conf = 0.5, target = "rules", maxlen = 3))

Remove redundant: 删除多余:

rules <- rules[!is.redundant(rules)]

Inspect: 检查:

arules::inspect(rules)

Create a dataframe: 创建数据框:

df = data.frame(
lhs = labels(lhs(rules)),
rhs = labels(rhs(rules)), 
rules@quality)

Just check out help for is.redundant() in rstudio, It clearly states that 在rstudio中查看is.redundant()的帮助,它清楚地说明了这一点

Suppose there is a 假设有一个

rule1 X->Y with confidence cf1 rule1 X-> Y有信心cf1

rule2 X' -> Y with confidence cf2 where X' is a subset of X rule2 X' - > Y有信心cf2,其中X'是X的子集

rule1 is said to be redundant if rule2 has a higher confidence than rule1 ie cf2 > cf1 (where X' is a subset of X) 如果rule2比rule1具有更高的置信度,则rule1被认为是多余的,即cf2> cf1(其中X'是X的子集)

ie if there is a rule where subset of lhs can give rhs with more confidence then prior rule is said to be redundant rule. 即如果存在一个规则,其中lhs的子集可以给予rhs更多的置信度,则先前的规则被认为是冗余规则。

  1. We make lower triangle as na so that the rule doesn't become subset of itself 我们将下三角形作为na,以便规则不会成为其自身的子集

  2. Insufficient information, rules cant be said redundant just on basis of subsetting, confidence value has to be taken in account 信息不足,规则不能仅仅基于子集来说是多余的,必须考虑置信度值

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM