简体   繁体   English

Pyspark-将数据帧写入2个不同的csv文件

[英]Pyspark - write a dataframe into 2 different csv files

I want to save a single DataFrame into 2 different csv files (splitting the DataFrame) - one would include just the header and another would include the rest of the rows. 我想将一个DataFrame保存到2个不同的csv文件中(拆分DataFrame)-一个将仅包含标题,另一个将包含其余行。

I want to save the 2 files under the same directory so Spark handling all the logic would be the best option if possible instead of splitting the csv file using pandas. 我想将2个文件保存在同一目录下,因此,如果可能的话,Spark处理所有逻辑将是最好的选择,而不是使用pandas拆分csv文件。

what would be the most efficient way to do this? 什么是最有效的方法?

Thanks for your help! 谢谢你的帮助!

Let's assume you've got Dataset called "df". 假设您有一个名为“ df”的数据集。

You can: Option one: write twice: 您可以:选项一:写两次:

df.write.(...).option("header", "false").csv(....)
df.take(1).option("header", "true").csv() // as far as I remember, someone had problems with saving DataFrame without rows -> you must write at least one row and then manually cut this row using normal Java or Python file API

Or you can write once with header = true and then manually cut the header and place it in new file using normal Java API 或者,您可以使用header = true编写一次,然后使用常规Java API手动剪切标题并将其放置在新文件中

Data, without header: 数据,不带标题:

df.to_csv("filename.csv", header=False)

Header, without data: 标头,无数据:

df_new = pd.DataFrame(data=None, columns=df_old.columns) # data=None makes sure no rows are copied to the new dataframe
df_new.to_csv("filename.csv")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM