简体   繁体   English

R-关联规则-先验

[英]R - association rules - apriori

I'm running the apriori algorithm like this: 我正在运行像这样的apriori算法:

rules <-apriori(dt)
inspect(rules)

where dt is my data.frame with this format: 其中dt是我的data.frame,具有以下格式:

> head(dt)
   Cus T C B
1:  C1 0 1 1
2:  C2 0 1 0
3:  C3 0 1 0
4:  C4 0 1 0
5:  C5 0 1 0
6:  C6 0 1 1

The idea of the data set is to capture the customer and whether he\\she bought three different items (T, C and B) on a particular purchase. 数据集的目的是捕获客户以及他/她是否在特定购买中购买了三个不同的商品(T,C和B)。 For example, based on the information above, we can see that C1 bought C and B; 例如,根据上述信息,我们可以看到C1购买了C和B。 customers C2 to C5 bought only C and customer C6 bought only C and B. 客户C2至C5仅购买了C,而客户C6仅购买了C和B。

the output is the following: 输出如下:

   lhs      rhs   support confidence      lift
1  {}    => {T=0}    0.90  0.9000000 1.0000000
2  {}    => {C=1}    0.91  0.9100000 1.0000000
3  {B=0} => {T=0}    0.40  0.8163265 0.9070295
4  {B=0} => {C=1}    0.40  0.8163265 0.8970621
5  {B=1} => {T=0}    0.50  0.9803922 1.0893246
6  {B=1} => {C=1}    0.51  1.0000000 1.0989011

My questions are: 我的问题是:

1) how can I get rid of rules where T,C or B are equal to 0. If you think about it, the rule {B=0} => {T=0} or even {B=1} => {T=0} doesn't really make sense. 1)如何摆脱T,C或B等于0的规则。如果考虑一下,规则{B = 0} => {T = 0}甚至{B = 1} => { T = 0}并没有任何意义。

2)I was reading about the apriori algorithm and in most of the examples, each line represents the actual transactions so in my case, it should be something like: 2)我正在阅读有关先验算法的信息,在大多数示例中,每一行代表实际的交易,因此在我的情况下,它应该类似于:

C,B
C
C
C
C
C, B

instead of my sets of ones and zeros, is that a rule? 而不是我的一组一和零,这是规则吗? Or can I still work with my format? 还是可以继续使用我的格式?

Thanks 谢谢

Not sure what the aim of the program is supposed to be, but the aim of the Apriori algorithm is first to extract frequent itemsets of a given data, in which frequent itemsets are a certain quantity of items which often appear as such quantity in the data. 不确定程序的目标是什么,但是Apriori算法的目标是首先提取给定数据的频繁项目集,其中频繁项目集是一定数量的项目,通常在数据中以这种数量出现。 And second to generate of those extracted frequent itemsets association rules. 然后生成那些提取的频繁项目集关联规则。 An association rule looks for example like this: 关联规则看起来像这样:

B -> C

Which in the stated case means, that customers who bought B buys C too to a certain probability. 在上述情况下,这意味着购买B的客户也一定会购买C。 Whereby the probability is determined by the support and confidence level of the Apriori algorithm. 因此,概率由Apriori算法的支持度和置信度确定。 The support level regulates the amount of frequent itemsets and the confidence level the amount of association rules. 支持级别规定了频繁项目集的数量,而置信级别规定了关联规则的数量。 Association rules over the confidence are called strong association rules. 超过信任度的关联规则称为强关联规则。

Do not understand against this backdrop why for the determination whether a customer bought different articles the Apriori algorithm is used. 在这种背景下不明白为什么要确定客户是否购买了其他商品,所以使用了Apriori算法。 This could be answered by an if statement. 可以通过if语句来回答。 And the provided output makes no sense in this context. 在这种情况下,提供的输出没有任何意义。 The output says for example for the third line that if a customer does not buy B then he buys not T with a support of 40% and a confidence of 81.6%. 例如,输出显示第三行,如果客户不购买B,那么他在40%的支持度和81.6%的置信度下不购买T。 Apart of that association rules does not have a support, only the association rule B -> C is correct, but it's confidence value wrong. 除了该关联规则没有支持之外,只有关联规则B-> C是正确的,但是其置信度值是错误的。

Nevertheless, if the aim is to generate described association rules the original Apriori cannot operate an input in this format: 但是,如果目的是生成描述的关联规则,则原始Apriori不能以这种格式操作输入:

> head(dt)
  Cus T C B
1: C1 0 1 1
2: C2 0 1 0
3: C3 0 1 0
4: C4 0 1 0
5: C5 0 1 0
6: C6 0 1 1

For the uncustomized Apriori algorithm a data set needs this format: 对于非定制的Apriori算法,数据集需要以下格式:

> head(dt) 
C1: {B, C} 
C2: {C}
C3: {C} 
C4: {C} 
C5: {C} 
C6: {B, C}

See two solutions: Either to format the input wherever or to customize the Apriori algorithm to this format what would be argubaly a change of the input format within the algorithm. 请参阅两种解决方案:无论在何处格式化输入,或将Apriori算法定制为这种格式,这都将极大地改变算法中的输入格式。 To clarify the need of the stated input format, the Apriori algorithm in a nutshell with the provided data: 为了阐明声明的输入格式的需求,简而言之,Apriori算法提供的数据如下:

Support level               = 0.3
Confidence level            = 0.3
Number of customers         = 6

Total number of B's bought  = 2
Total number of C's bought  = 6

Support of B                = 2 / 6 = 0.3 >= 0.3 = support level
Support of C                = 6 / 6 = 1   >= 0.3 = support level
Support of B, C             = 2 / 6 = 0.3 >= 0.3 = support level

-> Frequent itemsets        = {B, C, BC}

-> Association rules        = {B -> C}

Confidence of B -> C        = 2 / 2 = 1 >= 0.3 = confidence level

-> Strong association rules = {B -> C}

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM