简体   繁体   English

与groupBy聚合后将pyspark数据帧保存为CSV文件

[英]Saving pyspark dataframe after being aggregated with groupBy as csv file

I am learning pyspark and I am a bit confused on how to save a grouped dataframe as a csv file (assuming that for some reasons --eg RAM limitations-- I don't want to convert it first to a Pandas dataframe). 我正在学习pyspark,但对如何将分组数据帧另存为csv文件感到有些困惑(假设由于某些原因,例如RAM限制,我不想先将其转换为Pandas数据帧)。

For a reproducible example: 对于可重现的示例:

import seaborn as sns
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local') \
.appName('Data cleaning') \
.getOrCreate()
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate()
from pyspark.sql.functions import *

mpg= sns.load_dataset('mpg')
mpg_sp = spark.createDataFrame(mpg)
mpg_grp = mpg_sp.groupBy('model_year', 'origin').avg('displacement', 'weight')

# The command below fails in the sense that it creates a folder with multiple  files in it rather than a single csv file as I would expect

mpg_grp.write.csv('mpg_grp.csv')

# By applying the collect method I get a list which can not be saved as a csv file

mpg_grp1 = mpg_grp.collect()
type(mpg_grp1)
list

Above answer is correct but results of its use are not quite good. 上面的答案是正确的,但是使用它的效果不是很好。
Of course you can use repartition(1) or coalesce(1) but it will cause transferring all your data to a single worker and will greatly slow down your code. 当然,您可以使用repartition(1)或coalesce(1),但这将导致将所有数据传输到单个工作程序中,并且会大大降低代码速度。
In order to avoid this, I would suggest you to partition data on one of your columns in your dataset. 为了避免这种情况,建议您对数据集中的某一列上的数据进行分区。 And then write simple code to get one file per partition: 然后编写简单的代码以每个分区获取一个文件:

cols = ["$name"]
mpg_grp.repartition(cols).write.partitionBy(cols).csv("$location")

Thus, the data will be partitioned between workers by one of your columns and you will get exactly one file per your partition (by date as an example). 因此,数据将通过您的一列在工作人员之间进行分区,并且每个分区将仅获得一个文件(以日期为例)。

Spark is a distributed framework. Spark是一个分布式框架。 Therefore, the output in several files is a normal behavior... each worker will write it's part which results in several small files. 因此,几个文件中的输出是正常的行为……每个工作人员都会将其写入,这会导致产生几个小文件。

You can trick a bit the system using this command : 您可以使用以下命令来欺骗系统:

mpg_grp.coalesce(1).write.csv('mpg_grp.csv')

This will write only 1 file (but still in a folder which name is 'mpg_grp.csv'). 这将仅写入1个文件(但仍在名为“ mpg_grp.csv”的文件夹中)。
Caution: It may be quite slow. 注意:可能会很慢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM