[英]Apache Spark mapPartition strange behavior (lazy evaluation?)
I am trying to log the execution time of each mapPartition operation on a RDD using a code like this (in Scala): 我正在尝试使用以下代码(在Scala中)在RDD上记录每个mapPartition操作的执行时间:
rdd.mapPartitions{partition =>
val startTime = Calendar.getInstance().getTimeInMillis
result = partition.map{element =>
[...]
}
val endTime = Calendar.getInstance().getTimeInMillis
logger.info("Partition time "+(startTime-endTime)+ "ms")
result
}
The problem is that it logs the "partition time" immediately, before it start to execute the map operation, so I always obtain a time like 2 ms. 问题在于它在开始执行映射操作之前立即记录了“分区时间”,因此我总是获得2毫秒左右的时间。
I noticed it by watching the Spark Web UI, in the log file the row regarding the execution time appears immediately after the task started, not at the end as expected. 我通过观察Spark Web UI注意到了这一点,在日志文件中,有关执行时间的行在任务开始后立即出现,而不是在预期的末尾出现。
Someone is able to explain me why? 有人可以解释我为什么? Inside the mapPartitions the code should be executed linearly, or I am wrong?
在mapPartitions内部,代码应线性执行,否则我错了吗?
Thanks 谢谢
Regards Luca 问候卢卡
partitions
inside of mapPartitions
is an Iterator[Row]
, and an Iterator
is evaluated lazily in Scala (ie when the Iterator is consumed). partitions
内部的mapPartitions
是一个Iterator[Row]
,以及Iterator
被Scala中(即,当迭代器被消耗)懒惰地评估。 This has nothing to to with Spark's lazy evauation! 这与Spark的懒惰评估无关!
Calling partitions.size
will trigger the evaluation of your mapping, but will consume the Iterator (because it's only iterable once). 调用
partitions.size
将触发对映射的评估,但将消耗Iterator(因为它只能迭代一次)。 An example 一个例子
val it = Iterator(1,2,3)
it.size // 3
it.isEmpty // true
What you can do is to convert the Iterator to an non-lazy collection type: 您可以做的是将Iterator转换为非惰性集合类型:
rdd.mapPartitions{partition =>
val startTime = Calendar.getInstance().getTimeInMillis
result = partition.map{element =>
[...]
}.toVector // now the statements are evaluated
val endTime = Calendar.getInstance().getTimeInMillis
logger.info("Partition time "+(startTime-endTime)+ "ms")
result.toIterator
}
EDIT: Note that you can use System.currentTimeMillis()
(or even System.nanoTime()
) instead of using Calendar
. 编辑:请注意,您可以使用
System.currentTimeMillis()
(甚至System.nanoTime()
)来代替Calendar
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.