为什么使用 Kryo 序列化时 Spark 的性能更差？

Question

I enabled Kryo serialization for my Spark job, enabled the setting to require registration, and ensured all my types were registered.我为我的 Spark 作业启用了 Kryo 序列化，启用了需要注册的设置，并确保我的所有类型都已注册。

val conf = new SparkConf()
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.set("spark.kryo.registrationRequired", "true")
conf.registerKryoClasses(classes)
conf.registerAvroSchemas(avroSchemas: _*)

Wallclock-time performance of the job worsened by about 20% and the number of bytes shuffled increased by almost 400%.作业的挂钟时间性能下降了约 20%，混洗的字节数增加了近 400%。

This seems really surprising to me, given the Spark documentation 's suggestion that Kryo should be better.鉴于Spark 文档建议 Kryo 应该更好，这对我来说似乎真的很令人惊讶。

Kryo is significantly faster and more compact than Java serialization (often as much as 10x) Kryo 明显比 Java 序列化更快、更紧凑（通常高达 10 倍）

I manually invoked the serialize method on instances of Spark's org.apache.spark.serializer.KryoSerializer and org.apache.spark.serializer.JavaSerializer with an example of my data.我以我的数据示例手动调用了 Spark 的org.apache.spark.serializer.KryoSerializer和org.apache.spark.serializer.JavaSerializer实例上的serialize方法。 The results were consistent with the suggestions in the Spark documentation: Kryo produced 98 bytes;结果与 Spark 文档中的建议一致：Kryo 产生了 98 个字节； Java produced 993 bytes. Java 产生了 993 个字节。 That really is a 10x improvement.这确实是 10 倍的改进。

A possibly confounding factor is that the objects which are being serialized and shuffled implement the Avro GenericRecord interface.一个可能的混淆因素是被序列化和混洗的对象实现了 Avro GenericRecord接口。 I tried registering the Avro schemas in the SparkConf , but that showed no improvement.我尝试在SparkConf注册 Avro 模式，但没有显示出任何改进。

I tried making new classes to shuffle the data which were simple Scala case class es, not including any of the Avro machinery.我尝试创建新类来混洗数据，这些数据是简单的 Scala case class ，不包括任何 Avro 机器。 It didn't improve the shuffle performance or number of bytes exchanged.它没有提高 shuffle 性能或交换的字节数。

The Spark code ends up boiling down to following: Spark 代码最终归结为以下内容：

case class A(
    f1: Long,
    f2: Option[Long],
    f3: Int,
    f4: Int,
    f5: Option[String],
    f6: Option[Int],
    f7: Option[String],
    f8: Option[Int],
    f9: Option[Int],
    f10: Option[Int],
    f11: Option[Int],
    f12: String,
    f13: Option[Double],
    f14: Option[Int],
    f15: Option[Double],
    f16: Option[Double],
    f17: List[String],
    f18: String) extends org.apache.avro.specific.SpecificRecordBase {
  def get(f: Int) : AnyRef = ???
  def put(f: Int, value: Any) : Unit = ???
  def getSchema(): org.apache.avro.Schema = A.SCHEMA$
}
object A extends AnyRef with Serializable {
  val SCHEMA$: org.apache.avro.Schema = ???
}

case class B(
    f1: Long
    f2: Long
    f3: String
    f4: String) extends org.apache.avro.specific.SpecificRecordBase {
  def get(field$ : Int) : AnyRef = ???
  def getSchema() : org.apache.avro.Schema = B.SCHEMA$
  def put(field$ : Int, value : Any) : Unit = ???
}
object B extends AnyRef with Serializable {
  val SCHEMA$ : org.apache.avro.Schema = ???
}

def join(as: RDD[A], bs: RDD[B]): (Iterable[A], Iterable[B]) = {
  val joined = as.map(a => a.f1 -> a) cogroup bs.map(b => b.f1 -> b)
  joined.map { case (_, asAndBs) => asAndBs }
}

Do you have any idea what might be going on or how I could get the better performance that should be available from Kryo?您是否知道可能会发生什么，或者我如何获得 Kryo 应该提供的更好的性能？

Answer 1

If your single record size is too small and having huge number of records might make your job slow.Try to increase the buffer size and see whether it makes any improvement.如果您的单个记录大小太小，并且有大量记录可能会使您的工作变慢。尝试增加缓冲区大小，看看是否有任何改进。

Try the below one if not done already..如果还没有完成，请尝试下面的一个..

val conf = new SparkConf()
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  // Now it's 24 Mb of buffer by default instead of 0.064 Mb
  .set("spark.kryoserializer.buffer.mb","24")

Ref: https://ogirardot.wordpress.com/2015/01/09/changing-sparks-default-java-serialization-to-kryo/参考： https : //ogirardot.wordpress.com/2015/01/09/changed-sparks-default-java-serialization-to-kryo/

Answer 2

Since you have high cardinality RDDs, broadcasting/broadcast hash joining would seem to be off limits unfortunately.由于您拥有高基数 RDD，不幸的是，广播/广播哈希加入似乎是不受限制的。

Your best best is to coalesce() your RDDs prior to joining.最好的办法是在加入之前合并（）您的 RDD。 Are you seeing high skew in your shuffle times?您是否在洗牌时间中看到高偏差？ If so, you may want to coalesce with shuffle=true.如果是这样，您可能希望与 shuffle=true 合并。

Lastly, if you have RDDs of nested structures (eg JSON), that will sometimes allow you to bypass shuffles.最后，如果您有嵌套结构的 RDD（例如 JSON），有时可以让您绕过 shuffle。 Check out the slides and/or video here for a more detailed explanation.查看此处的幻灯片和/或视频以获得更详细的解释。

为什么使用 Kryo 序列化时 Spark 的性能更差？

问题描述

2 个解决方案

解决方案1
3 2017-05-24 06:20:08

解决方案2
0 2017-04-30 16:59:10

为什么使用 Kryo 序列化时 Spark 的性能更差？

问题描述

2 个解决方案

解决方案1 3 2017-05-24 06:20:08

解决方案2 0 2017-04-30 16:59:10

解决方案1
3 2017-05-24 06:20:08

解决方案2
0 2017-04-30 16:59:10