Pyspark-将数据帧写入2个不同的csv文件

Question

I want to save a single DataFrame into 2 different csv files (splitting the DataFrame) - one would include just the header and another would include the rest of the rows. 我想将一个DataFrame保存到2个不同的csv文件中（拆分DataFrame）-一个将仅包含标题，另一个将包含其余行。

I want to save the 2 files under the same directory so Spark handling all the logic would be the best option if possible instead of splitting the csv file using pandas. 我想将2个文件保存在同一目录下，因此，如果可能的话，Spark处理所有逻辑将是最好的选择，而不是使用pandas拆分csv文件。

what would be the most efficient way to do this? 什么是最有效的方法？

Thanks for your help! 谢谢你的帮助！

Answer 1

Let's assume you've got Dataset called "df". 假设您有一个名为“ df”的数据集。

You can: Option one: write twice: 您可以：选项一：写两次：

df.write.(...).option("header", "false").csv(....)
df.take(1).option("header", "true").csv() // as far as I remember, someone had problems with saving DataFrame without rows -> you must write at least one row and then manually cut this row using normal Java or Python file API

Or you can write once with header = true and then manually cut the header and place it in new file using normal Java API 或者，您可以使用header = true编写一次，然后使用常规Java API手动剪切标题并将其放置在新文件中

Answer 2

Data, without header: 数据，不带标题：

df.to_csv("filename.csv", header=False)

Header, without data: 标头，无数据：

df_new = pd.DataFrame(data=None, columns=df_old.columns) # data=None makes sure no rows are copied to the new dataframe
df_new.to_csv("filename.csv")

Pyspark-将数据帧写入2个不同的csv文件

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-09-13 13:40:09

解决方案2
0 2017-09-13 13:44:05

Pyspark-将数据帧写入2个不同的csv文件

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-09-13 13:40:09

解决方案2 0 2017-09-13 13:44:05

解决方案1
2 已采纳 2017-09-13 13:40:09

解决方案2
0 2017-09-13 13:44:05