简体   繁体   English

重新分区pyspark数据框失败以及如何避免初始分区大小

[英]Repartitioning a pyspark dataframe fails and how to avoid the initial partition size

I'm trying to tune the performance of spark, by the use of partitioning on a spark dataframe. 我正在尝试通过在spark数据帧上使用分区来调整spark的性能。 Here is the code: 这是代码:

file_path1 = spark.read.parquet(*paths[:15])
df = file_path1.select(columns) \
    .where((func.col("organization") == organization)) 
df = df.repartition(10)
#execute an action just to make spark execute the repartition step
df.first()

During the execution of first() I check the job stages in Spark UI and here what I find: 在执行first()我检查了Spark UI中的作业阶段以及在这里找到的内容: 工作细节 阶段7的详细信息

  • Why there is no repartition step in the stage? 为什么阶段中没有repartition步骤?
  • Why there is also stage 8? 为什么还会有第8阶段? I only requested one action of first() . 我只要求first()一个动作。 Is it because of the shuffle caused by the repartition ? 是因为repartition引起的repartition吗?
  • Is there a way to change the repartition of the parquet files without having to occur to such operations? 有没有一种方法可以更改实木复合地板文件的分区而不必进行此类操作? As initially when I read the df you can see that it's partitioned over 43k partitions which is really a lot (compared to its size when I save it to a csv file: 4 MB with 13k rows) and creating problems in further steps, that's why I wanted to repartition it. 最初,当我阅读df您可以看到它被划分为43k个分区,这确实很多(与将其保存到csv文件时的大小相比:4 MB含13k行),并在进一步的步骤中产生了问题,这就是为什么我想重新分区。
  • Should I use cache() after repartition? 重新分区后应该使用cache()吗? df = df.repartition(10).cache() ? df = df.repartition(10).cache()吗? As when I executed df.first() the second time, I also get a scheduled stage with 43k partitions, despite df.rdd.getNumPartitions() which returned 10. EDIT: the number of partitions is just to try. 当我第二次执行df.first()时,尽管df.rdd.getNumPartitions()返回了10,但我也获得了具有43k分区的预定阶段。 编辑:分区数df.rdd.getNumPartitions() my questions are directed to help me understand how to do the right repartition. 我的问题旨在帮助我理解如何正确进行分区。

Note: initially the Dataframe is read from a selection of parquet files in Hadoop. 注意:最初,Dataframe是从Hadoop中的一组镶木地板文件中读取的。

I already read this as reference How does Spark partition(ing) work on files in HDFS? 我已经阅读了此作为参考,Spark分区如何在HDFS中的文件上工作?

  • Whenever there is shuffling, there is a new stage. 只要有改组,就会有一个新的阶段。 and the
    repartition causes shuffling that"s why you have two stages. 重新分区会导致改组,这就是为什么您需要两个阶段。
  • the caching is used when you'll use the dataframe multiple times to avoid reading it twice. 缓存将在您多次使用数据框以避免两次读取时使用。

Use coalesce instead of repartiton. 使用合并而不是重新分配。 I think it causes less shuffling since it only reduces the number of partitions. 我认为这会减少改组,因为它只会减少分区数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM