枚举是Spark PairRDD引起问题的关键

Question

Some operation's on Spark PairRDD 's don't work correctly when the RDD's key is or contains an enum. 当RDD的键为或包含枚举时，Spark PairRDD上的某些操作无法正常工作。

For example, the following piece of Spark code takes two weeks worth of weekdays and counts them, by weekday: 例如，以下一段Spark代码花费了两个星期的工作日，并按工作日进行计数：

import java.time.DayOfWeek
val weekdays: Seq[(DayOfWeek, Int)] = DayOfWeek.values().map(dow => (dow, 1))
val numPartitions = 2 * weekdays.size
val result = sc
  .parallelize(weekdays ++ weekdays, numPartitions)
  .reduceByKey(_ + _)
  .collect
  .toSeq
println(result)

In the output, I'd expect every weekday (eg, MONDAY ) to have count 2, however, on my cluster, I get: 在输出中，我希望每个工作日（例如MONDAY ）的计数为2，但是在我的集群上，我得到：

WrappedArray(
  (THURSDAY,1), (SATURDAY,1), (WEDNESDAY,2), (SATURDAY,1),
  (MONDAY,2), (TUESDAY,2), (THURSDAY,1), (FRIDAY,2), (SUNDAY,2)
)

If you run this on a cluster with a single node (or set numPartitions to 1), the result is correct (ie, all counts are 2). 如果在具有单个节点的群集上运行此程序（或将numPartitions设置为1），则结果是正确的（即，所有计数均为2）。

Answer 1

Spark PairRDD 's operations like aggregateByKey() , reduceByKey() , combineByKey() take an optional argument to specify the Partitioner that Spark is to use. 星火PairRDD样的操作aggregateByKey() reduceByKey() combineByKey()需要一个可选的参数来指定Partitioner火花是使用。 If you don't specify a partitioner explicitly, Spark's HashPartitioner is used, which calls a row's key's hashCode() method and uses it to assign the row to a partition. 如果未明确指定分区程序，则使用Spark的HashPartitioner ，它调用行键的hashCode()方法，并使用该方法将行分配给分区。 However, the hashCode() of an enum is not guaranteed to be the same on different JVMs processes – even if they run on the same Java version. 但是，不能保证枚举的hashCode()在不同的JVM进程上是相同的-即使它们在相同的Java版本上运行。 As a consequence, Spark xyzByKey() operations don't work correctly. 因此，Spark xyzByKey()操作无法正常工作。

In the above example, there are two pairs (THURSDAY, 1) in the input and each gets processed on a different executor. 在上面的示例中，输入中有两对(THURSDAY, 1) ，每对在不同的执行器上进行处理。 The example uses a HashPartitioner with 14 (= numPartitions ) partitions. 该示例使用具有14（= numPartitions ）个分区的HashPartitioner 。 Since (THURSDAY, 1).hashCode() % 14 produces different results on these two executors, these two rows get sent to different executors to be reduced, resulting in an incorrect result. 由于(THURSDAY, 1).hashCode() % 14在这两个执行器上产生不同的结果，因此这两行被发送到不同的执行器以进行缩减，从而导致错误的结果。

Bottomline: Don't use HashPartitioner with objects whose hashcode's aren't consistent over different JVM processes. 底线：不要将HashPartitioner用于其哈希码在不同的JVM进程中不一致的对象。 In particular, the following objects aren't guaranteed to produce the same hashcode on different JVM processes: 特别是，不能保证以下对象在不同的JVM进程上产生相同的哈希码：

Java enum 's Java enum
Scala sealed trait -based enum's: Scala sealed trait基于sealed trait的枚举：

sealed trait TraitEnum
object TEA extends TraitEnum
object TEB extends TraitEnum

Scala abstract class -based enum's: Scala基于abstract class的枚举：

sealed abstract class AbstractClassEnum
object ACA extends AbstractClassEnum
object ACB extends AbstractClassEnum

Any key that contains a nested object of one of the above types (and doesn't have a custom hashCode() implementation). 包含上述类型之一的嵌套对象的任何键（并且没有自定义hashCode()实现）。

However, Scala case class -based enum's, have a consistent hashcode and are thus safe to use: 但是，基于Scala case class的枚举具有一致的哈希码，因此可以安全使用：

sealed case class CaseClassEnum(…) # “…" must be a non-empty list of parameters
object CCA extends CaseClassEnum(…)
object CCB extends CaseClassEnum(…)

Additional info: 附加信息：

Blog post: http://dev.bizo.com/2014/02/beware-enums-in-spark.html 博客文章： http ： //dev.bizo.com/2014/02/beware-enums-in-spark.html
Ticket on Spark with a proposal to catch this at compile/runtime Spark上的故障单，并提出了在编译/运行时捕获此故障的建议

枚举是Spark PairRDD引起问题的关键

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-05-02 20:56:00

枚举是Spark PairRDD引起问题的关键

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-05-02 20:56:00

解决方案1
2 已采纳 2017-05-02 20:56:00