简体   繁体   English

如何从 dataframe 获取 1000 条记录并使用 PySpark 写入文件?

[英]How to get 1000 records from dataframe and write into a file using PySpark?

I am having 100,000+ of records in dataframe. I want to create a file dynamically and push 1000 records per file.我在 dataframe 中有 100,000 多条记录。我想动态创建一个文件并为每个文件推送 1000 条记录。 Can anyone help me to solve this, thanks in advance.谁能帮我解决这个问题,在此先感谢。

You can usemaxRecordsPerFile option while writing dataframe .您可以在编写dataframe时使用maxRecordsPerFile选项。

  • If you need whole dataframe to write 1000 records in each file then use repartition(1) (or) write 1000 records for each partition use .coalesce(1)如果您需要整个 dataframe在每个文件中写入 1000 条记录,请使用 repartition(1 repartition(1) (or)每个分区写入 1000 条记录使用.coalesce(1)

Example:

# 1000 records written per file in each partition
df.coalesce(1).write.option("maxRecordsPerFile", 1000).mode("overwrite").parquet(<path>)

# 1000 records written per file for dataframe 100 files created for 100,000
df.repartition(1).write.option("maxRecordsPerFile", 1000).mode("overwrite").parquet(<path>)

#or by set config on spark session
spark.conf.set("spark.sql.files.maxRecordsPerFile", 1000)
#or
spark.sql("set spark.sql.files.maxRecordsPerFile=1000").show()

df.coalesce(1).write.mode("overwrite").parquet(<path>)
df.repartition(1).write.mode("overwrite").parquet(<path>)

Method-2:

Caluculating number of partitions then repartition the dataframe:

df = spark.range(10000)

#caluculate partitions
no_partitions=df.count()/1000

from pyspark.sql.functions import *

#repartition and check number of records on each partition
df.repartition(no_partitions).\
withColumn("partition_id",spark_partition_id()).\
groupBy(col("partition_id")).\
agg(count("*")).\
show()

#+-----------+--------+
#|partiton_id|count(1)|
#+-----------+--------+
#|          1|    1001|
#|          6|    1000|
#|          3|     999|
#|          5|    1000|
#|          9|    1000|
#|          4|     999|
#|          8|    1000|
#|          7|    1000|
#|          2|    1001|
#|          0|    1000|
#+-----------+--------+

df.repartition(no_partitions).write.mode("overwrite").parquet(<path>)

Firstly, create a row number column首先,创建一个行号列

df = df.withColumn('row_num', F.row_number().over(Window.orderBy('any_column'))

Now, run a loop and keep saving the records.现在,运行一个循环并继续保存记录。

for i in range(0, df.count(), 1000):
   records = df.where(F.col("row_num").between(i, i+999))
   records.toPandas().to_csv("file-{}.csv".format(i))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM