如何从 dataframe 获取 1000 条记录并使用 PySpark 写入文件？

Question

I am having 100,000+ of records in dataframe. I want to create a file dynamically and push 1000 records per file.我在 dataframe 中有 100,000 多条记录。我想动态创建一个文件并为每个文件推送 1000 条记录。 Can anyone help me to solve this, thanks in advance.谁能帮我解决这个问题，在此先感谢。

Answer 1

You can usemaxRecordsPerFile option while writing dataframe .您可以在编写dataframe时使用maxRecordsPerFile选项。

If you need whole dataframe to write 1000 records in each file then use repartition(1) (or) write 1000 records for each partition use .coalesce(1)如果您需要整个 dataframe在每个文件中写入 1000 条记录，请使用 repartition(1 repartition(1) (or)为每个分区写入 1000 条记录使用.coalesce(1)

Example:

# 1000 records written per file in each partition
df.coalesce(1).write.option("maxRecordsPerFile", 1000).mode("overwrite").parquet(<path>)

# 1000 records written per file for dataframe 100 files created for 100,000
df.repartition(1).write.option("maxRecordsPerFile", 1000).mode("overwrite").parquet(<path>)

#or by set config on spark session
spark.conf.set("spark.sql.files.maxRecordsPerFile", 1000)
#or
spark.sql("set spark.sql.files.maxRecordsPerFile=1000").show()

df.coalesce(1).write.mode("overwrite").parquet(<path>)
df.repartition(1).write.mode("overwrite").parquet(<path>)

Method-2:

Caluculating number of partitions then repartition the dataframe:

df = spark.range(10000)

#caluculate partitions
no_partitions=df.count()/1000

from pyspark.sql.functions import *

#repartition and check number of records on each partition
df.repartition(no_partitions).\
withColumn("partition_id",spark_partition_id()).\
groupBy(col("partition_id")).\
agg(count("*")).\
show()

#+-----------+--------+
#|partiton_id|count(1)|
#+-----------+--------+
#|          1|    1001|
#|          6|    1000|
#|          3|     999|
#|          5|    1000|
#|          9|    1000|
#|          4|     999|
#|          8|    1000|
#|          7|    1000|
#|          2|    1001|
#|          0|    1000|
#+-----------+--------+

df.repartition(no_partitions).write.mode("overwrite").parquet(<path>)

Answer 2

Firstly, create a row number column首先，创建一个行号列

df = df.withColumn('row_num', F.row_number().over(Window.orderBy('any_column'))

Now, run a loop and keep saving the records.现在，运行一个循环并继续保存记录。

for i in range(0, df.count(), 1000):
   records = df.where(F.col("row_num").between(i, i+999))
   records.toPandas().to_csv("file-{}.csv".format(i))

如何从 dataframe 获取 1000 条记录并使用 PySpark 写入文件？

问题描述

2 个解决方案

解决方案1
4 已采纳 2020-04-24 16:20:25

解决方案2
1 2020-04-24 16:06:10

如何从 dataframe 获取 1000 条记录并使用 PySpark 写入文件？

问题描述

2 个解决方案

解决方案1 4 已采纳 2020-04-24 16:20:25

解决方案2 1 2020-04-24 16:06:10

解决方案1
4 已采纳 2020-04-24 16:20:25

解决方案2
1 2020-04-24 16:06:10