简体   繁体   English

在 Python 中使用 FP-Growth 算法确定最频繁模式

[英]Using FP-Growth algorithm in Python to determine the most frequent pattern

I have used FP-Growth algorithm in python using the mlxtend.frequent_patterns fpgrowth library.我使用 mlxtend.frequent_patterns fpgrowth 库在 python 中使用了 FP-Growth 算法。 I have followed the code that was mentioned in their page and I have generated the rules which I feel are recursive.我遵循了他们页面中提到的代码,并且生成了我认为是递归的规则。 I have formed a dataframe using those rules.我使用这些规则形成了一个 dataframe。 Now I am trying to calculate support and lift using loops but it is taking a lot of time, which I am finding inefficient.现在我正在尝试使用循环计算支撑和提升,但这需要很多时间,我发现这效率低下。

The code I have used are as follows:我使用的代码如下:

records = []
for i in range(0, 13748):
    records.append([str(df.values[i,j]) for j in range(0, 12)])

patterns = pyfpgrowth. find_frequent_patterns(records, 10)

rules = pyfpgrowth. generate_association_rules(patterns,0.8)


def support_count(rhs):
    count=0
    rhs=set(rhs)
    for j in data_item['Items']:
        j=set(j)
        if(rhs.issubset(j)):
            count=count+1
    return count


rhs_support=[]
for i in df_r['Consequent']:
    a=support_count(i)
    rhs_support.append(a/len(data_item))

Is there any other easier way to calculate support and lift using FPGrowth?有没有其他更简单的方法来使用 FPGrowth 计算支撑和提升?

These calculations require a lot of computation and can be slow on large data sets.这些计算需要大量计算,并且在大型数据集上可能会很慢。 One of the best ways to solve for this is to simply run as many of these calculations in parallel as you can.解决此问题的最佳方法之一是尽可能多地并行运行这些计算。 Your local machine may not be sufficient enough to provide the speed you are looking for.您的本地计算机可能不足以提供您正在寻找的速度。

If you have access to cloud computing, I would recommend using pySpark to achieve your goals.如果您可以访问云计算,我建议您使用 pySpark 来实现您的目标。 The SparkML library has FPGrowth built in and I have used it to build a production recommendation system that processes millions of transactions with about half a million products and the entire process takes about 20 minutes including all of the metrics you are asking for. SparkML 库内置了 FPGrowth,我用它构建了一个生产推荐系统,该系统处理数百万笔交易和大约一百万种产品,整个过程大约需要 20 分钟,包括您要求的所有指标。 This is of course using a rather large cluster, with about 200 cores total, so your own performance is going to be proportional to the amount of compute you are willing to pay for.这当然是使用一个相当大的集群,总共大约有 200 个内核,因此您自己的性能将与您愿意支付的计算量成正比。

In any case, if you have never tried before, I'd recommend looking into DataBricks on the Azure platform.无论如何,如果您以前从未尝试过,我建议您查看 Azure 平台上的 DataBricks。 You can get a free trial and the code to implement FPGrowth is very simple.您可以免费试用,实现 FPGrowth 的代码非常简单。

FPGrowth in SparkML SparkML 中的 FPG 增长

DataBricks数据砖

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM