Apache Spark mapPartition奇怪的行为（懒惰的评估？）

Question

I am trying to log the execution time of each mapPartition operation on a RDD using a code like this (in Scala): 我正在尝试使用以下代码（在Scala中）在RDD上记录每个mapPartition操作的执行时间：

rdd.mapPartitions{partition =>
   val startTime = Calendar.getInstance().getTimeInMillis
   result = partition.map{element =>
      [...]
   }
   val endTime = Calendar.getInstance().getTimeInMillis
   logger.info("Partition time "+(startTime-endTime)+ "ms")
   result
}

The problem is that it logs the "partition time" immediately, before it start to execute the map operation, so I always obtain a time like 2 ms. 问题在于它在开始执行映射操作之前立即记录了“分区时间”，因此我总是获得2毫秒左右的时间。

I noticed it by watching the Spark Web UI, in the log file the row regarding the execution time appears immediately after the task started, not at the end as expected. 我通过观察Spark Web UI注意到了这一点，在日志文件中，有关执行时间的行在任务开始后立即出现，而不是在预期的末尾出现。

Someone is able to explain me why? 有人可以解释我为什么？ Inside the mapPartitions the code should be executed linearly, or I am wrong? 在mapPartitions内部，代码应线性执行，否则我错了吗？

Thanks 谢谢

Regards Luca 问候卢卡

Answer 1

partitions inside of mapPartitions is an Iterator[Row] , and an Iterator is evaluated lazily in Scala (ie when the Iterator is consumed). partitions内部的mapPartitions是一个Iterator[Row] ，以及Iterator被Scala中（即，当迭代器被消耗）懒惰地评估。 This has nothing to to with Spark's lazy evauation! 这与Spark的懒惰评估无关！

Calling partitions.size will trigger the evaluation of your mapping, but will consume the Iterator (because it's only iterable once). 调用partitions.size将触发对映射的评估，但将消耗Iterator（因为它只能迭代一次）。 An example 一个例子

val it = Iterator(1,2,3)
it.size // 3
it.isEmpty // true

What you can do is to convert the Iterator to an non-lazy collection type: 您可以做的是将Iterator转换为非惰性集合类型：

rdd.mapPartitions{partition =>
   val startTime = Calendar.getInstance().getTimeInMillis
   result = partition.map{element =>
      [...]
   }.toVector // now the statements are evaluated
   val endTime = Calendar.getInstance().getTimeInMillis
   logger.info("Partition time "+(startTime-endTime)+ "ms")
   result.toIterator
}

EDIT: Note that you can use System.currentTimeMillis() (or even System.nanoTime() ) instead of using Calendar . 编辑：请注意，您可以使用System.currentTimeMillis() （甚至System.nanoTime() ）来代替Calendar 。

Apache Spark mapPartition奇怪的行为（懒惰的评估？）

问题描述

1 个解决方案

解决方案1
4 已采纳 2017-08-02 13:57:51

Apache Spark mapPartition奇怪的行为（懒惰的评估？）

问题描述

1 个解决方案

解决方案1 4 已采纳 2017-08-02 13:57:51

解决方案1
4 已采纳 2017-08-02 13:57:51