简体   繁体   English

为什么我的火花工作有这么多的任务? 默认情况下获取200个任务

[英]Why so many tasks in my spark job? Getting 200 Tasks By Default

I have a spark job that takes a file with 8 records from hdfs, does a simple aggregation and saves it back to hdfs. 我有一个spark作业,它接收来自hdfs的8条记录的文件,做一个简单的聚合并将其保存回hdfs。 I notice there are like hundreds of tasks when I do this. 当我这样做时,我注意到有数百个任务。

I also am not sure why there are multiple jobs for this? 我也不确定为什么有这么多工作? I thought a job was more like when an action happened. 我觉得工作更像是一个动作发生的时候。 I can speculate as to why - but my understanding was that inside of this code it should be one job and it should be broken down into stages, not multiple jobs. 我可以推测为什么 - 但我的理解是,在这段代码中它应该是一个工作,它应该分解为阶段,而不是多个工作。 Why doesn't it just break it down into stages, how come it breaks into jobs? 为什么不把它分解成各个阶段,为什么它会闯入工作岗位?

As far as the 200 plus tasks, since the amount of data and the amount of nodes is miniscule, it doesn't make sense that there is like 25 tasks for each row of data when there is only one aggregations and a couple of filters. 至于200多个任务,由于数据量和节点数量微乎其微,当只有一个聚合和几个过滤器时,每行数据有25个任务是没有意义的。 Why wouldn't it just have one task per partition per atomic operation? 为什么每个原子操作每个分区只有一个任务?

Here is the relevant scala code - 这是相关的scala代码 -

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object TestProj {object TestProj {
  def main(args: Array[String]) {

    /* set the application name in the SparkConf object */
    val appConf = new SparkConf().setAppName("Test Proj")

    /* env settings that I don't need to set in REPL*/
    val sc = new SparkContext(appConf)
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._

    val rdd1 = sc.textFile("hdfs://node002:8020/flat_files/miscellaneous/ex.txt")

     /*the below rdd will have schema defined in Record class*/
     val rddCase =  sc.textFile("hdfs://node002:8020/flat_files/miscellaneous/ex.txt")
      .map(x=>x.split(" "))    //file record into array of strings based spaces
      .map(x=>Record(
        x(0).toInt,
        x(1).asInstanceOf[String],
        x(2).asInstanceOf[String],
        x(3).toInt))


    /* the below dataframe groups on first letter of first name and counts it*/
    val aggDF = rddCase.toDF()
      .groupBy($"firstName".substr(1,1).alias("firstLetter"))
      .count
      .orderBy($"firstLetter")

    /* save to hdfs*/ 
 aggDF.write.format("parquet").mode("append").save("/raw/miscellaneous/ex_out_agg")

  }

    case class Record(id: Int
      , firstName: String
      , lastName: String
      , quantity:Int)

}

Below is the screen shot after clicking on the application 下面是单击应用程序后的屏幕截图 在此输入图像描述

Below is are the stages show when viewing the specific "job" of id 0 以下是查看id 0的特定“作业”时显示的阶段 在此输入图像描述

Below is the first part of the screen when clicking on the stage with over 200 tasks 下面是单击具有200多个任务的舞台时屏幕的第一部分

在此输入图像描述

This is the second part of the screen inside the stage 这是舞台内屏幕的第二部分 在此输入图像描述

Below is after clicking on the "executors" tab 下面是点击“执行者”选项卡后 在此输入图像描述

As requested, here are the stages for Job ID 1 根据要求,这里是工作ID 1的阶段

在此输入图像描述

Here are the details for the stage in job ID 1 with 200 tasks 以下是具有200个任务的作业ID 1中的阶段的详细信息

在此输入图像描述

This is a classic Spark question. 这是一个经典的Spark问题。

The two tasks used for reading (Stage Id 0 in second figure) is the defaultMinPartitions setting which is set to 2. You can get this parameter by reading the value in the REPL sc.defaultMinPartitions . 用于读取的两个任务(第二个图中的阶段Id 0)是defaultMinPartitions设置,它设置为2.您可以通过读取REPL sc.defaultMinPartitions的值来获取此参数。 It should also be visible in the Spark UI under the "Environment" tab. 它也应该在“环境”选项卡下的Spark UI中可见。

You can take a look at the code from GitHub to see that this exactly what is happening. 你可以看一下GitHub中的代码 ,看看这到底发生了什么。 If you want more partitions to be used on read, just add it as a parameter eg, sc.textFile("a.txt", 20) . 如果您希望在读取时使用更多分区,只需将其添加为参数,例如sc.textFile("a.txt", 20)

Now the interesting part comes from the 200 partitions that come on the second stage (Stage Id 1 in second figure). 现在有趣的部分来自第二阶段出现的200个分区(第二个阶段的阶段Id 1)。 Well, each time there is a shuffle, Spark needs to decide how many partitions will the shuffle RDD have. 好吧,每次有一个shuffle,Spark需要决定shuffle RDD有多少个分区。 As you can imagine, the default is 200. 可以想象,默认值为200。

You can change that using: 您可以使用以下方法更改:

sqlContext.setConf("spark.sql.shuffle.partitions", "4”)

If you run your code with this configuration you will see that the 200 partitions are not going to be there any more. 如果使用此配置运行代码,您将看到200个分区不再存在。 How to set this parameter is kind of an art. 如何设置此参数是一种艺术。 Maybe choose 2x the number of cores you have (or whatever). 也许选择2倍的核心数量(或其他)。

I think Spark 2.0 has a way to automatically infer the best number of partitions for shuffle RDDs. 我认为Spark 2.0有一种方法可以自动推断shuffle RDD的最佳分区数。 Looking forward to that! 期待那样!

Finally, the number of jobs you get has to do with how many RDD actions the resulting optimized Dataframe code resulted to. 最后,您获得的作业数量与生成的优化Dataframe代码产生的RDD操作数量有关。 If you read the Spark specs it says that each RDD action will trigger one job. 如果您阅读Spark规范,它会说每个RDD操作都会触发一个作业。 When you action involves a Dataframe or SparkSQL the Catalyst optimizer will figure out an execution plan and generate some RDD based code to execute it. 当您的操作涉及Dataframe或SparkSQL时,Catalyst优化器将找出执行计划并生成一些基于RDD的代码来执行它。 It's hard to say exactly why it uses two actions in your case. 在你的情况下,很难确切地说它为什么会使用两个动作。 You may need to look at the optimized query plan to see exactly what is doing. 您可能需要查看优化的查询计划,以确切了解正在执行的操作。

I am having a similar problem. 我有个类似的问题。 But in my scenario the collection I am parallelizing has less elements than the number of tasks scheduled by Spark (causing spark to behave oddly sometimes). 但在我的场景中,我并行化的集合比Spark计划的任务数量少(导致火花有时奇怪地表现)。 Using the forced partition number I was able to fix this issue. 使用强制分区号我能解决这个问题。

It was something like this: 它是这样的:

collection = range(10) # In the real scenario it was a complex collection
sc.parallelize(collection).map(lambda e: e + 1) # also a more complex operation in the real scenario

Then, I saw in the Spark log: 然后,我在Spark日志中看到:

INFO YarnClusterScheduler: Adding task set 0.0 with 512 tasks

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM