为什么在调用 collect() 时我的 Spark 工作会卡住？

Question

Am new to spark.我是新来的火花。 I have created a spark job which does some data processing per user.我创建了一个 spark 作业，它对每个用户进行一些数据处理。 What am trying to do is get all files in a directory and process those files.我想要做的是获取目录中的所有文件并处理这些文件。 There are multiple directories and is of multiple users.有多个目录并且是多个用户。

After reading files within one users directory I do some transformation post which I need to work on them collectively (like remove some duplicates based on data).在读取一个用户目录中的文件后，我做了一些转换帖子，我需要共同处理它们（例如根据数据删除一些重复项）。 For doing this I am calling the collect() over the RDD.为此，我通过 RDD 调用collect() 。

When running this with 10 directories, it works fine, but when running with a 1000 directories it gets stuck at the collect() call.当使用 10 个目录运行它时，它工作正常，但是当使用 1000 个目录运行时，它会卡在collect()调用中。

I have only done local testing.我只做过本地测试。

Initing spark:引发火花：

private lazy val sparkSession = SparkSession
.builder()
.appName("Custom Job")
.master("local[*]")
.getOrCreate()

Reading directories and parellelizing:读取目录和parellelizing：

val allDirs: Seq[String] = fs.getAllDirInPath(Configuration.inputDir)
val paths: RDD[String] = SessionWrapper.getSparkContext.parallelize(allDirs)

Transformations and collect call:转换和collect电话：

paths.foreachPartition { partition =>
      partition.foreach { dir =>
        val dirData = readDataByDir(dir) // RDD[String]
        val transformed = doTranform(dirData) // RDD[CustomObject]
        val collectedData = tranformed.collect()
        // Do something on collected data
        writeToFile(collectedData)
      }
    }

Some logs from the stuck console:来自卡住的控制台的一些日志：

20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO FileInputFormat: Total input paths to process : 3
20/09/09 19:24:40 INFO CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 1, size left: 65935
20/09/09 19:24:40 INFO DAGScheduler: Got job 2 (collect at MyCustomHelperWithCollectCall.scala:18) with 2 output partitions
20/09/09 19:24:40 INFO DAGScheduler: Final stage: ResultStage 2 (collect at MyCustomHelperWithCollectCall.scala:18)
20/09/09 19:24:40 INFO DAGScheduler: Parents of final stage: List()
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO DAGScheduler: Missing parents: List()
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[102] at map at MyCustomHelperWithCollectCall.scala:18), which has no missing parents
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO MemoryStore: Block broadcast_14 stored as values in memory (estimated size 7.2 KiB, free 2001.0 MiB)
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes in memory (estimated size 3.5 KiB, free 2001.0 MiB)
20/09/09 19:24:40 INFO BlockManagerInfo: Added broadcast_14_piece0 in memory on 192.168.31.222:55666 (size: 3.5 KiB, free: 2004.3 MiB)
20/09/09 19:24:40 INFO SparkContext: Created broadcast 14 from broadcast at DAGScheduler.scala:1200
20/09/09 19:24:40 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 2 (MapPartitionsRDD[102] at map at MyCustomHelperWithCollectCall.scala:18) (first 15 tasks are for partitions Vector(0, 1))
20/09/09 19:24:40 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks
20/09/09 19:24:40 INFO DAGScheduler: Got job 3 (collect at MyCustomHelperWithCollectCall.scala:18) with 1 output partitions
20/09/09 19:24:40 INFO DAGScheduler: Final stage: ResultStage 3 (collect at MyCustomHelperWithCollectCall.scala:18)
20/09/09 19:24:40 INFO DAGScheduler: Parents of final stage: List()
20/09/09 19:24:40 INFO DAGScheduler: Missing parents: List()
20/09/09 19:24:40 INFO DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[96] at map at MyCustomHelperWithCollectCall.scala:18), which has no missing parents
20/09/09 19:24:40 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 7.2 KiB, free 2001.0 MiB)
20/09/09 19:24:40 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 3.5 KiB, free 2001.0 MiB)
20/09/09 19:24:40 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on 192.168.31.222:55666 (size: 3.5 KiB, free: 2004.3 MiB)
20/09/09 19:24:40 INFO SparkContext: Created broadcast 15 from broadcast at DAGScheduler.scala:1200
20/09/09 19:24:40 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[96] at map at MyCustomHelperWithCollectCall.scala:18) (first 15 tasks are for partitions Vector(0))
20/09/09 19:24:40 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
20/09/09 19:24:40 INFO DAGScheduler: Got job 4 (collect at MyCustomHelperWithCollectCall.scala:18) with 1 output partitions
20/09/09 19:24:40 INFO DAGScheduler: Final stage: ResultStage 4 (collect at MyCustomHelperWithCollectCall.scala:18)
20/09/09 19:24:40 INFO DAGScheduler: Parents of final stage: List()
20/09/09 19:24:40 INFO DAGScheduler: Missing parents: List()
20/09/09 19:24:40 INFO DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[101] at map at MyCustomHelperWithCollectCall.scala:18), which has no missing parents
20/09/09 19:24:40 INFO FileInputFormat: Total input paths to process : 5

Please help !请帮忙！

Answer 1

Collect (Action) - Return all the elements of the dataset as an array at the driver program.收集（操作） - 在驱动程序中将数据集的所有元素作为数组返回。 This is usually useful after a filter or other operation that returns a sufficiently small subset of the data这通常在过滤器或其他返回足够小的数据子集的操作之后很有用

Seems like you have less memory dont call collect use似乎您的内存较少，不要调用收集使用

df.show(100)

or或者

df.take(100)

Also update your Spark DSG graph into question to understand processing同时更新您的 Spark DSG 图以了解处理

为什么在调用 collect() 时我的 Spark 工作会卡住？

问题描述

1 个解决方案

解决方案1
0 2020-09-09 15:48:23

为什么在调用 collect() 时我的 Spark 工作会卡住？

问题描述

1 个解决方案

解决方案1 0 2020-09-09 15:48:23

解决方案1
0 2020-09-09 15:48:23