[英]Repartitioning a pyspark dataframe fails and how to avoid the initial partition size
I'm trying to tune the performance of spark, by the use of partitioning on a spark dataframe. 我正在尝试通过在spark数据帧上使用分区来调整spark的性能。 Here is the code:
这是代码:
file_path1 = spark.read.parquet(*paths[:15])
df = file_path1.select(columns) \
.where((func.col("organization") == organization))
df = df.repartition(10)
#execute an action just to make spark execute the repartition step
df.first()
During the execution of first()
I check the job stages in Spark UI and here what I find: 在执行
first()
我检查了Spark UI中的作业阶段以及在这里找到的内容:
repartition
step in the stage? repartition
步骤? first()
. first()
一个动作。 Is it because of the shuffle caused by the repartition
? repartition
引起的repartition
吗? df
you can see that it's partitioned over 43k partitions which is really a lot (compared to its size when I save it to a csv file: 4 MB with 13k rows) and creating problems in further steps, that's why I wanted to repartition it. df
您可以看到它被划分为43k个分区,这确实很多(与将其保存到csv文件时的大小相比:4 MB含13k行),并在进一步的步骤中产生了问题,这就是为什么我想重新分区。 cache()
after repartition? cache()
吗? df = df.repartition(10).cache()
? df = df.repartition(10).cache()
吗? As when I executed df.first()
the second time, I also get a scheduled stage with 43k partitions, despite df.rdd.getNumPartitions()
which returned 10. EDIT: the number of partitions is just to try. df.first()
时,尽管df.rdd.getNumPartitions()
返回了10,但我也获得了具有43k分区的预定阶段。 编辑:分区数df.rdd.getNumPartitions()
。 my questions are directed to help me understand how to do the right repartition. Note: initially the Dataframe is read from a selection of parquet files in Hadoop. 注意:最初,Dataframe是从Hadoop中的一组镶木地板文件中读取的。
I already read this as reference How does Spark partition(ing) work on files in HDFS? 我已经阅读了此作为参考,Spark分区如何在HDFS中的文件上工作?
Use coalesce instead of repartiton. 使用合并而不是重新分配。 I think it causes less shuffling since it only reduces the number of partitions.
我认为这会减少改组,因为它只会减少分区数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.