[英]Enums as key in Spark PairRDD's causing problems
Some operation's on Spark PairRDD
's don't work correctly when the RDD's key is or contains an enum. 当RDD的键为或包含枚举时,Spark
PairRDD
上的某些操作无法正常工作。
For example, the following piece of Spark code takes two weeks worth of weekdays and counts them, by weekday: 例如,以下一段Spark代码花费了两个星期的工作日,并按工作日进行计数:
import java.time.DayOfWeek
val weekdays: Seq[(DayOfWeek, Int)] = DayOfWeek.values().map(dow => (dow, 1))
val numPartitions = 2 * weekdays.size
val result = sc
.parallelize(weekdays ++ weekdays, numPartitions)
.reduceByKey(_ + _)
.collect
.toSeq
println(result)
In the output, I'd expect every weekday (eg, MONDAY
) to have count 2, however, on my cluster, I get: 在输出中,我希望每个工作日(例如
MONDAY
)的计数为2,但是在我的集群上,我得到:
WrappedArray(
(THURSDAY,1), (SATURDAY,1), (WEDNESDAY,2), (SATURDAY,1),
(MONDAY,2), (TUESDAY,2), (THURSDAY,1), (FRIDAY,2), (SUNDAY,2)
)
If you run this on a cluster with a single node (or set numPartitions
to 1), the result is correct (ie, all counts are 2). 如果在具有单个节点的群集上运行此程序(或将
numPartitions
设置为1),则结果是正确的(即,所有计数均为2)。
Spark PairRDD
's operations like aggregateByKey()
, reduceByKey()
, combineByKey()
take an optional argument to specify the Partitioner
that Spark is to use. 星火
PairRDD
样的操作aggregateByKey()
reduceByKey()
combineByKey()
需要一个可选的参数来指定Partitioner
火花是使用。 If you don't specify a partitioner explicitly, Spark's HashPartitioner
is used, which calls a row's key's hashCode()
method and uses it to assign the row to a partition. 如果未明确指定分区程序,则使用Spark的
HashPartitioner
,它调用行键的hashCode()
方法,并使用该方法将行分配给分区。 However, the hashCode()
of an enum is not guaranteed to be the same on different JVMs processes – even if they run on the same Java version. 但是,不能保证枚举的
hashCode()
在不同的JVM进程上是相同的-即使它们在相同的Java版本上运行。 As a consequence, Spark xyzByKey()
operations don't work correctly. 因此,Spark
xyzByKey()
操作无法正常工作。
In the above example, there are two pairs (THURSDAY, 1)
in the input and each gets processed on a different executor. 在上面的示例中,输入中有两对
(THURSDAY, 1)
,每对在不同的执行器上进行处理。 The example uses a HashPartitioner
with 14 (= numPartitions
) partitions. 该示例使用具有14(=
numPartitions
)个分区的HashPartitioner
。 Since (THURSDAY, 1).hashCode() % 14
produces different results on these two executors, these two rows get sent to different executors to be reduced, resulting in an incorrect result. 由于
(THURSDAY, 1).hashCode() % 14
在这两个执行器上产生不同的结果,因此这两行被发送到不同的执行器以进行缩减,从而导致错误的结果。
Bottomline: Don't use HashPartitioner
with objects whose hashcode's aren't consistent over different JVM processes. 底线:不要将
HashPartitioner
用于其哈希码在不同的JVM进程中不一致的对象。 In particular, the following objects aren't guaranteed to produce the same hashcode on different JVM processes: 特别是,不能保证以下对象在不同的JVM进程上产生相同的哈希码:
enum
's enum
sealed trait
-based enum's: sealed trait
基于sealed trait
的枚举: sealed trait TraitEnum
object TEA extends TraitEnum
object TEB extends TraitEnum
abstract class
-based enum's: abstract class
的枚举: sealed abstract class AbstractClassEnum
object ACA extends AbstractClassEnum
object ACB extends AbstractClassEnum
hashCode()
implementation). hashCode()
实现)。 However, Scala case class
-based enum's, have a consistent hashcode and are thus safe to use: 但是,基于Scala
case class
的枚举具有一致的哈希码,因此可以安全使用:
sealed case class CaseClassEnum(…) # “…" must be a non-empty list of parameters
object CCA extends CaseClassEnum(…)
object CCB extends CaseClassEnum(…)
Additional info: 附加信息:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.