简体   繁体   English

频繁项集和关联规则-Apriori算法

[英]Frequent Itemsets & Association Rules - Apriori Algorithm

I'm trying to understand the fundamentals of the Apriori (Basket) Algorithm for use in data mining, 我正在尝试了解用于数据挖掘的Apriori(篮子)算法的基础知识,

It's best I explain the complication i'm having with an example: 最好用一个例子来说明我的复杂性:

Here is a transactional dataset: 这是一个事务数据集:

t1: Milk, Chicken, Beer
t2: Chicken, Cheese
t3: Cheese, Boots
t4: Cheese, Chicken, Beer
t5: Chicken, Beer, Clothes, Cheese, Milk
t6: Clothes, Beer, Milk
t7: Beer, Milk, Clothes

The minsup for the above is 0.5 or 50%. 上面的值是0.5或50%。

Taking from the above, my number of transactions is clearly 7 , meaning for an itemset to be "frequent" it must have a count of 4/7 . 综上所述,我的交易次数显然为7 ,这意味着某个项目集“频繁出现”的次数必须为4/7 As such this was my Frequent itemset 1: 因此,这就是我的频繁项目集1:

F1: F1:

Milk = 4
Chicken = 4
Beer = 5
Cheese = 4

I then created my candidates for the second refinement (C2) and narrowed it down to: 然后,我为第二个优化(C2)创建了候选者,并将其范围缩小到:

F2: F2:

{Milk, Beer} = 4

This is where I get confused, if I am asked to display all frequent itemsets do I write down all of F1 and F2 or just F2 ? 这是让我感到困惑的地方,如果要求我显示所有频繁的项目集,我要写下F1F2还是F2 F1 to me aren't "sets". F1对我来说不是“集合”。

I am then asked to create association rules for the frequent itemsets I have just defined and calculate their "confidence" figures, I get this: 然后,我被要求为我刚刚定义的频繁项目集创建关联规则,并计算它们的“置信度”数字,我得到:

Milk -> Beer = 100% confidence
Beer -> Milk = 80% confidence

It seems superfluous to put F1 's itemsets in here as they will all have a confidence of 100% regardless and don't actually "associate" anything, which is the reason I am now questioning whether F1 are indeed "frequent"? F1的项目集放在这里似乎是多余的,因为它们将具有100%的置信度,无论它们是否实际上不“关联”任何东西,这就是我现在质疑F1是否确实“频繁”的原因。

Itemsets with size of 1 considered frequent if their support is suitable. 如果支持的大小合适,则认为大小为1的项目集很频繁。 But here you have to consider the minimal threshold . 但是这里您必须考虑最小阈值 like if your minimal threshold in your example is 2 then F1 will not be considered. 例如,如果您的示例中的最小阈值为2,则不会考虑F1 But if the minimal threshold is 1 then you have to. 但是,如果最小阈值为1,则必须这样做。

you can take a look here and here for more ideas and examples. 您可以在这里这里看看更多的想法和示例。

Hope that I helped. 希望我能帮上忙。

If the minimum support threshold (minsup) is 4 / 7, then you should include single items in the set of frequent itemsets if they appear in no less than 4 transactions out of 7. So in your example, you should include them: 如果最小支持阈值(minsup)为4/7,则如果单个项目出现在7个事务中的至少4个事务中,则应将其包含在频繁项目集中。因此,在您的示例中,应包括它们:

Milk = 4 Chicken = 4 Beer = 5 Cheese = 4 牛奶= 4鸡= 4啤酒= 5奶酪= 4

For the association rules, they have the form X ==> Y where X and Y are disjoint itemsets and it is generally assumed that X and Y are not empty sets (and this is what is assumed by Apriori). 对于关联规则,它们的格式为X ==> Y,其中X和Y是不相交的项目集,通常假定X和Y不是空集(这是Apriori假定的)。 So therefore, you need at least two items to generate an association rule. 因此,您至少需要两项才能生成关联规则。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM