简体   繁体   English

python中.CSV或.XLSX文件中使用pyspark生成的关联规则如何高效导出

[英]How to efficiently export association rule generated using pyspark in .CSV or .XLSX file in python

After resolving this issue: How to limit FPGrowth itemesets to just 2 or 3 I am trying to export the association rule output of fpgrowth using pyspark to .csv file in python.解决此问题后: 如何将 FPGrowth 项集限制为 2 或 3我正在尝试使用 pyspark 将 fpgrowth 的关联规则输出导出到 python 中的 .csv 文件。 After running for almost 8-10 hrs it gives an error.运行近 8-10 小时后,出现错误。 My machine has enough space and memory.我的机器有足够的空间和内存。

    Association Rule output is like this:

    Antecedent           Consequent      Lift
    ['A','B']              ['C']           1

The code is in the link: How to limit FPGrowth itemesets to just 2 or 3 Just adding one more line代码在链接中: 如何将 FPGrowth 项集限制为 2 或 3只需再添加一行

    ar = ar.coalesce(24)
    ar.write.csv('/output', header=True)

Configuration used:使用的配置:

 ``` conf = SparkConf().setAppName("App")
     conf = (conf.setMaster('local[*]')
    .set('spark.executor.memory', '200G')
    .set('spark.driver.memory', '700G')
    .set('spark.driver.maxResultSize', '400G')) #8,45,10
    sc = SparkContext.getOrCreate(conf=conf)
  spark = SparkSession(sc)

This keeps on running and consumed 1000GB of my C:/ drive这继续运行并消耗了我的 C:/ 驱动器的 1000GB

Is there any efficient way to save the output in .CSV format or .XLSX format.有什么有效的方法可以将输出保存为 .CSV 格式或 .XLSX 格式。

The error is:错误是:

  ```The error is:

   Py4JJavaError: An error occurred while calling o207.csv.
   org.apache.spark.SparkException: Job aborted.at 
   org.apache.spark.sql.execution.
   datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)

   atorg.apache.spark.sql.execution.datasources.InsertIntoHadoopFs
   RelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
   at 
   org.apache.spark.sql.execution.command.
  DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at org.apache.spark.sql.execution.command.
  DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:664)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 9.0 failed 1 times, most recent failure: Lost task 10.0 in stage 9.0 (TID 226, localhost, executor driver): java.io.IOException: There is not enough space on the disk
at java.io.FileOutputStream.writeBytes(Native Method)



     The progress:
     19/07/15 14:12:32 WARN TaskSetManager: Stage 1 contains a task of very large size (26033 KB). The maximum recommended task size is 100 KB.
     19/07/15 14:12:33 WARN TaskSetManager: Stage 2 contains a task of very large size (26033 KB). The maximum recommended task size is 100 KB.
     19/07/15 14:12:38 WARN TaskSetManager: Stage 4 contains a task of very large size (26033 KB). The maximum recommended task size is 100 KB.
     [Stage 5:>                (0 + 24) / 24][Stage 6:>                 (0 + 0) / 24][I 14:14:02.723 NotebookApp] Saving file at /app1.ipynb
     [Stage 5:==>              (4 + 20) / 24][Stage 6:===>              (4 + 4) / 24]

Like already stated in the comments you should try to aviod toPandas(), as this function loads all your data to the driver.就像评论中已经说过的那样,您应该尝试避免使用 toPandas(),因为此函数会将您的所有数据加载到驱动程序中。 You can use the pysparks DataFrameWriter to write out your data, but you have to cast your array columns (antecedent and consequent) to a different format before you can write your data to csv, as arrays aren't support.您可以使用 pysparks DataFrameWriter写出您的数据,但您必须先将数组列(先行列和后续列)转换为不同的格式,然后才能将数据写入 csv,因为不支持数组。 One way to cast your columns to a supported type like string is concat_ws .将列转换为受支持的类型(如字符串)的一种方法是concat_ws

import pyspark.sql.functions as F
from pyspark.ml.fpm import FPGrowth

df = spark.createDataFrame([
    (0, [1, 2, 5]),
    (1, [1, 2, 3, 5]),
    (2, [1, 2])
], ["id", "items"])

fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df)
ar=model.associationRules.withColumn('antecedent', F.concat_ws('-', F.col("antecedent").cast("array<string>")))\
                         .withColumn('consequent', F.concat_ws('-', F.col("consequent").cast("array<string>")))
ar.show()

Output:输出:

+----------+----------+------------------+----+ 
|antecedent|consequent|        confidence|lift| 
+----------+----------+------------------+----+ 
|         5|         1|               1.0| 1.0| 
|         5|         2|               1.0| 1.0| 
|       1-2|         5|0.6666666666666666| 1.0| 
|       5-2|         1|               1.0| 1.0| 
|       5-1|         2|               1.0| 1.0| 
|         2|         1|               1.0| 1.0| 
|         2|         5|0.6666666666666666| 1.0| 
|         1|         2|               1.0| 1.0| 
|         1|         5|0.6666666666666666| 1.0| 
+----------+----------+------------------+----+

You can now write your data to csv:您现在可以将数据写入 csv:

ar.write.csv('/bla', header=True)

This will create a csv file for each partition.这将为每个分区创建一个 csv 文件。 You can change the number of partitions with:您可以通过以下方式更改分区数:

ar = ar.coalesce(1)

If spark is not able to write the csv file due to memory issue, try a differnt number of partitions (before you call ar.write) and concat the files with other tools if necessary.如果 spark 由于内存问题而无法写入 csv 文件,请尝试不同数量的分区(在调用 ar.write 之前)并在必要时使用其他工具连接文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 pyspark 读取 excel xlsx 文件 - How to read excel xlsx file using pyspark 如何按日期将 Pyspark dataframe 导出到 csv 文件 - How to export Pyspark dataframe to csv file by date 如何使用 xlsx 文件作为输入使用 pyspark 创建 db 表 - How to create a db table using xlsx file as a input using pyspark 如何使用Pyspark读取excel文件(.xlsx)并存入dataframe? - How to read excel file (.xlsx) using Pyspark and store it in dataframe? 如何使用 Pyspark 在 ADLS 中写入 .csv 文件 - How to write .csv File in ADLS Using Pyspark 如何将 excel (.xlsx) 文件读入 pyspark dataframe - How to read excel (.xlsx) file into a pyspark dataframe 如何有效地将 pyspark 数据帧上传为压缩的 csv 或 parquet 文件(类似于 .gz 格式) - How to EFFICIENTLY upload a a pyspark dataframe as a zipped csv or parquet file(similiar to.gz format) 如何在pyspark中使用df.write.csv附加到csv文件? - How to append to a csv file using df.write.csv in pyspark? 使用 Python 或 Z77BB59DCD89559748E5DB56956C1 读取基于 Position 的 CSV 文件 - Read a Position based CSV file using Python or pyspark 如何使用spark数据帧(python / pyspark)从csv文件中跳过不需要的标题 - How to skip unwanted headers from csv file using spark dataframe(python/pyspark)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM