Pyspark FP 增长实现运行缓慢

Question

I am using the pyspark.ml.fpm (FP Growth) implementation of association rule mining on Spark v2.3.我在 Spark v2.3 上使用关联规则挖掘的 pyspark.ml.fpm (FP Growth) 实现。

The spark UI shows that the tasks as the end run very slowly. spark UI 显示作为结束的任务运行非常缓慢。 This seems to be a common problem and might be related to data skew.这似乎是一个常见问题，可能与数据倾斜有关。

Is this the real reason?这是真正的原因吗？ Is there any solution for this?有什么解决办法吗？

I don't want to change the minSupport or minConfidence thresholds because that would effect by results.我不想更改 minSupport 或 minConfidence 阈值，因为这会受到结果的影响。 Removing the columns isn't a solution either.删除列也不是解决方案。

Answer 1

I was facing a similar issue.我面临着类似的问题。 One solution you might try is setting a threshold on the amount of products in a transaction.您可以尝试的一种解决方案是为交易中的产品数量设置阈值。 If there are a couple of transactions that have way more products than the average, the tree computed by FP Growth blows up.如果有几笔交易的产品比平均值多得多，那么 FP Growth 计算的树就会炸毁。 This causes the runtime increases significantly and the risk for memory errors is much higher.这会导致运行时间显着增加，并且出现内存错误的风险要高得多。

Hence, doing outlier removal on the transactions with disproportional amount of products might do the trick.因此，对产品数量不成比例的交易进行异常值去除可能会奏效。

Hope this helps you out a bit :)希望这对你有所帮助:)

Answer 2

Late answer but I also had an issue with long FPGrowth wait times, and the above answer really helped.迟到的答案，但我也遇到了 FPGrowth 等待时间长的问题，上面的答案确实有帮助。 Implemented as such to filter out anything that's above one standard deviation (this is after the transactions have been grouped):如此实现以过滤掉任何超过一个标准偏差的东西（这是在交易分组之后）：

def clean_transactions(df):
    transactions_init = df.withColumn("basket_size", size("basket"))
    print('---collecting stats')
    df_stats = transactions_init.select(
        _mean(col('basket_size')).alias('mean'),
        _stddev(col('basket_size')).alias('std')
    ).collect()
    mean = df_stats[0]['mean']
    std = df_stats[0]['std']
    max_ct = mean + std
    print("--filtering out outliers")
    transactions_cleaned = transactions_init.filter(transactions_init.basket_size <= max_ct)
    return transactions_cleaned

Pyspark FP 增长实现运行缓慢

问题描述

2 个解决方案

解决方案1
0 2020-02-18 08:34:19

解决方案2
0 2022-07-20 13:20:02

Pyspark FP 增长实现运行缓慢

问题描述

2 个解决方案

解决方案1 0 2020-02-18 08:34:19

解决方案2 0 2022-07-20 13:20:02

解决方案1
0 2020-02-18 08:34:19

解决方案2
0 2022-07-20 13:20:02