python中.CSV或.XLSX文件中使用pyspark生成的关联规则如何高效导出

Question

After resolving this issue: How to limit FPGrowth itemesets to just 2 or 3 I am trying to export the association rule output of fpgrowth using pyspark to .csv file in python.解决此问题后：如何将 FPGrowth 项集限制为 2 或 3我正在尝试使用 pyspark 将 fpgrowth 的关联规则输出导出到 python 中的 .csv 文件。 After running for almost 8-10 hrs it gives an error.运行近 8-10 小时后，出现错误。 My machine has enough space and memory.我的机器有足够的空间和内存。

    Association Rule output is like this:

    Antecedent           Consequent      Lift
    ['A','B']              ['C']           1

The code is in the link: How to limit FPGrowth itemesets to just 2 or 3 Just adding one more line代码在链接中：如何将 FPGrowth 项集限制为 2 或 3只需再添加一行

    ar = ar.coalesce(24)
    ar.write.csv('/output', header=True)

Configuration used:使用的配置：

 ``` conf = SparkConf().setAppName("App")
     conf = (conf.setMaster('local[*]')
    .set('spark.executor.memory', '200G')
    .set('spark.driver.memory', '700G')
    .set('spark.driver.maxResultSize', '400G')) #8,45,10
    sc = SparkContext.getOrCreate(conf=conf)
  spark = SparkSession(sc)

This keeps on running and consumed 1000GB of my C:/ drive这继续运行并消耗了我的 C:/ 驱动器的 1000GB

Is there any efficient way to save the output in .CSV format or .XLSX format.有什么有效的方法可以将输出保存为 .CSV 格式或 .XLSX 格式。

The error is:错误是：

  ```The error is:

   Py4JJavaError: An error occurred while calling o207.csv.
   org.apache.spark.SparkException: Job aborted.at 
   org.apache.spark.sql.execution.
   datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)

   atorg.apache.spark.sql.execution.datasources.InsertIntoHadoopFs
   RelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
   at 
   org.apache.spark.sql.execution.command.
  DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at org.apache.spark.sql.execution.command.
  DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:664)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 9.0 failed 1 times, most recent failure: Lost task 10.0 in stage 9.0 (TID 226, localhost, executor driver): java.io.IOException: There is not enough space on the disk
at java.io.FileOutputStream.writeBytes(Native Method)




     The progress:
     19/07/15 14:12:32 WARN TaskSetManager: Stage 1 contains a task of very large size (26033 KB). The maximum recommended task size is 100 KB.
     19/07/15 14:12:33 WARN TaskSetManager: Stage 2 contains a task of very large size (26033 KB). The maximum recommended task size is 100 KB.
     19/07/15 14:12:38 WARN TaskSetManager: Stage 4 contains a task of very large size (26033 KB). The maximum recommended task size is 100 KB.
     [Stage 5:>                (0 + 24) / 24][Stage 6:>                 (0 + 0) / 24][I 14:14:02.723 NotebookApp] Saving file at /app1.ipynb
     [Stage 5:==>              (4 + 20) / 24][Stage 6:===>              (4 + 4) / 24]

Answer 1

Like already stated in the comments you should try to aviod toPandas(), as this function loads all your data to the driver.就像评论中已经说过的那样，您应该尝试避免使用 toPandas()，因为此函数会将您的所有数据加载到驱动程序中。 You can use the pysparks DataFrameWriter to write out your data, but you have to cast your array columns (antecedent and consequent) to a different format before you can write your data to csv, as arrays aren't support.您可以使用 pysparks DataFrameWriter写出您的数据，但您必须先将数组列（先行列和后续列）转换为不同的格式，然后才能将数据写入 csv，因为不支持数组。 One way to cast your columns to a supported type like string is concat_ws .将列转换为受支持的类型（如字符串）的一种方法是concat_ws 。

import pyspark.sql.functions as F
from pyspark.ml.fpm import FPGrowth

df = spark.createDataFrame([
    (0, [1, 2, 5]),
    (1, [1, 2, 3, 5]),
    (2, [1, 2])
], ["id", "items"])

fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df)
ar=model.associationRules.withColumn('antecedent', F.concat_ws('-', F.col("antecedent").cast("array<string>")))\
                         .withColumn('consequent', F.concat_ws('-', F.col("consequent").cast("array<string>")))
ar.show()

Output:输出：

+----------+----------+------------------+----+ 
|antecedent|consequent|        confidence|lift| 
+----------+----------+------------------+----+ 
|         5|         1|               1.0| 1.0| 
|         5|         2|               1.0| 1.0| 
|       1-2|         5|0.6666666666666666| 1.0| 
|       5-2|         1|               1.0| 1.0| 
|       5-1|         2|               1.0| 1.0| 
|         2|         1|               1.0| 1.0| 
|         2|         5|0.6666666666666666| 1.0| 
|         1|         2|               1.0| 1.0| 
|         1|         5|0.6666666666666666| 1.0| 
+----------+----------+------------------+----+

You can now write your data to csv:您现在可以将数据写入 csv：

ar.write.csv('/bla', header=True)

This will create a csv file for each partition.这将为每个分区创建一个 csv 文件。 You can change the number of partitions with:您可以通过以下方式更改分区数：

ar = ar.coalesce(1)

If spark is not able to write the csv file due to memory issue, try a differnt number of partitions (before you call ar.write) and concat the files with other tools if necessary.如果 spark 由于内存问题而无法写入 csv 文件，请尝试不同数量的分区（在调用 ar.write 之前）并在必要时使用其他工具连接文件。

python中.CSV或.XLSX文件中使用pyspark生成的关联规则如何高效导出

问题描述

1 个解决方案

解决方案1
1 2019-07-03 14:16:47

python中.CSV或.XLSX文件中使用pyspark生成的关联规则如何高效导出

问题描述

1 个解决方案

解决方案1 1 2019-07-03 14:16:47

解决方案1
1 2019-07-03 14:16:47