简体   繁体   English

Scala 案例类的最快序列化/反序列化

[英]Fastest serialization/deserialization of Scala case classes

If I've got a nested object graph of case classes, similar to the example below, and I want to store collections of them in a redis list, what libraries or tools should I look at that that will give the fastest overall round trip to redis?如果我有一个案例类的嵌套对象图,类似于下面的示例,并且我想将它们的集合存储在一个 redis 列表中,那么我应该查看哪些库或工具来提供最快的整体往返Redis?

This will include:这将包括:

  • Time to serialize the item序列化项目的时间
  • network cost of transferring the serialized data传输序列化数据的网络成本
  • network cost of retrieving stored serialized data检索存储的序列化数据的网络成本
  • time to deserialize back into case classes是时候反序列化回 case 类了

    case class Person(name: String, age: Int, children: List[Person]) {}

UPDATE (2018): scala/pickling is no longer actively maintained.更新(2018 年):scala/pickling 不再积极维护。 There are hoards of other libraries that have arisen as alternatives which take similar approaches but which tend to focus on specific serialization formats;还有大量其他库作为替代方案出现,它们采用类似的方法,但往往侧重于特定的序列化格式; eg, JSON, binary, protobuf.例如,JSON、二进制、protobuf。

Your use case is exactly the targeted use case for scala/pickling ( https://github.com/scala/pickling ).您的用例正是scala/pickling ( https://github.com/scala/pickling ) 的目标用例。 Disclaimer: I'm an author .免责声明:我是作者

Scala/pickling was designed to be a faster, more typesafe, and more open alternative to automatic frameworks like Java or Kryo. Scala/pickling 旨在成为 Java 或 Kryo 等自动框架的更快、更类型安全和更开放的替代方案。 It was built in particular for distributed applications, so serialization/deserialization time and serialized data size take a front seat.它是专门为分布式应用程序构建的,因此序列化/反序列化时间和序列化数据大小处于领先地位。 It takes a different approach to serialization all together- it generates pickling (serialization) code inline at the use-site at compile-time, so it's really very fast.它采用不同的方法进行序列化 - 它在编译时在使用站点内联生成酸洗(序列化)代码,因此它真的非常快。

The latest benchmarks are in our OOPSLA paper - for the binary pickle format (you can also choose others, like JSON) scala/pickling is consistently faster than Java and Kryo, and produces binary representations that are on par or smaller than Kryo's, meaning less latency when passing your pickled data over the network.最新的基准测试在我们的OOPSLA 论文中- 对于二进制 pickle 格式(您也可以选择其他格式,例如 JSON),scala/pickling 始终比 Java 和 Kryo 快,并且生成的二进制表示与 Kryo 的相同或更小,这意味着更少通过网络传递腌制数据时的延迟。

For more info, there's a project page: http://lampwww.epfl.ch/~hmiller/pickling有关更多信息,有一个项目页面: http : //lampwww.epfl.ch/~hmiller/pickling

And a ScalaDays 2013 talk from June on Parley's .以及ScalaDays 2013 6 月在 Parley 的.

We'll also be presenting some new developments in particular related to dealing with sending closures over the network at Strange Loop 2013, in case that might also be a pain point for your use case.我们还将在 Strange Loop 2013 上展示一些与处理通过网络发送闭包相关的新进展,以防万一这也可能成为您的用例的痛点。

As of the time of this writing, scala/pickling is in pre-release, with our first stable release planned for August 21st.在撰写本文时,scala/pickling 处于预发布阶段,我们计划在 8 月 21 日发布第一个稳定版本。

Update:更新:

You must be careful to use the serialize methods from JDK.使用 JDK 中的序列化方法时必须小心。 The performance is not great and one small change in your class will make the data unable to deserialize.性能不是很好,你的类中的一个小变化会使数据无法反序列化。


I've used scala/pickling but it has a global lock while serializing/deserializing.我使用过 scala/pickling,但它在序列化/反序列化时有一个全局锁。

So instead of using it, I write my own serialization/deserialization code like this:因此,我没有使用它,而是像这样编写自己的序列化/反序列化代码:

import java.io._

object Serializer {

  def serialize[T <: Serializable](obj: T): Array[Byte] = {
    val byteOut = new ByteArrayOutputStream()
    val objOut = new ObjectOutputStream(byteOut)
    objOut.writeObject(obj)
    objOut.close()
    byteOut.close()
    byteOut.toByteArray
  }

  def deserialize[T <: Serializable](bytes: Array[Byte]): T = {
    val byteIn = new ByteArrayInputStream(bytes)
    val objIn = new ObjectInputStream(byteIn)
    val obj = objIn.readObject().asInstanceOf[T]
    byteIn.close()
    objIn.close()
    obj
  }
}

Here is an example of using it:下面是一个使用它的例子:

case class Example(a: String, b: String)

val obj = Example("a", "b")
val bytes = Serializer.serialize(obj)
val obj2 = Serializer.deserialize[Example](bytes)

According to the upickle benchmarks: "uPickle runs 30-50% faster than Circe for reads/writes, and ~200% faster than play-json" for serializing case classes.根据upickle基准:“uPickle 在读/写方面比 Circe 快 30-50%,比 play-json 快约 200%”用于序列化案例类。

It's easy to use, here's how to serialize a case class to a JSON string:使用起来很简单,下面是将case类序列化为JSON字符串的方法:

case class City(name: String, funActivity: String, latitude: Double)
val bengaluru = City("Bengaluru", "South Indian food", 12.97)
implicit val cityRW = upickle.default.macroRW[City]
upickle.default.write(bengaluru) // "{\"name\":\"Bengaluru\",\"funActivity\":\"South Indian food\",\"latitude\":12.97}"

You can also serialize to binary or other formats.您还可以序列化为二进制或其他格式。

The accepted answer from 2013 proposes a library that is no longer maintained. 2013 年接受的答案提出了一个不再维护的库。 There are many similar questions on StackOverflow but I really couldn't find a good answer which would meet the following criteria: StackOverflow 上有很多类似的问题,但我真的找不到满足以下条件的好答案:

  • serialization/ deserialization should be fast序列化/反序列化应该很快
  • high performance data exchange over the wire where you only encode as much metadata as you need通过线路进行高性能数据交换,您只需根据需要编码尽可能多的元数据
  • supports schema evolution so that changing the serialized object (ex: case class ) doesn't break past deserializations支持模式演变,以便更改序列化对象(例如: case class )不会破坏反序列化

I recommend against using low-level JDK SerDes (like ByteArrayOutputStream and ByteArrayInputStream ).我建议不要使用低级 JDK SerDes(如ByteArrayOutputStreamByteArrayInputStream )。 Supporting schema evolution becomes a pain and it's difficult to make it work with external services (ex: Thrift ) since you have no control if the data being sent back used the same type of streams.支持模式演变变得很痛苦,并且很难让它与外部服务(例如: Thrift )一起工作,因为如果发回的数据使用相同类型的流,你无法控制。

Some people use the JSON spec, using libraries like json4s but it is not suitable for distributed computing message transfer.有些人使用 JSON 规范,使用json4s 之类的库,但它不适合分布式计算消息传输。 It marshalls data as a JSON string so it'll be both slower and storage inefficient, since it will use 8 bits to store every character in the string.它将数据编组为 JSON 字符串,因此速度较慢且存储效率低,因为它将使用 8 位来存储字符串中的每个字符。

I highly recommend using the MessagePack binary serialization format.我强烈建议使用MessagePack二进制序列化格式。 I would recommend reading the spec to understand the encoding specifics.我建议阅读规范以了解编码细节。 It has implementations in many different languages, here's a generic example I wrote for a Scala case class that you can copy-paste in your code.它有许多不同语言的实现,这是我为 Scala case class编写的一个通用示例,您可以在代码中复制粘贴。

import java.nio.ByteBuffer
import java.util.concurrent.TimeUnit

import org.msgpack.core.MessagePack

case class Data(message: String, number: Long, timeUnit: TimeUnit, price: Long)

object Data extends App {

  def serialize(data: Data): ByteBuffer = {
    val packer = MessagePack.newDefaultBufferPacker
    packer
      .packString(data.message)
      .packLong(data.number)
      .packString(data.timeUnit.toString)
      .packLong(data.price)
    packer.close()
    ByteBuffer.wrap(packer.toByteArray)
  }

  def deserialize(data: ByteBuffer): Data = {
    val unpacker = MessagePack.newDefaultUnpacker(convertDataToByteArray(data))
    val newdata = Data.apply(
      message = unpacker.unpackString(),
      number = unpacker.unpackLong(),
      timeUnit = TimeUnit.valueOf(unpacker.unpackString()),
      price = unpacker.unpackLong()
    )
    unpacker.close()
    newdata
  }

  def convertDataToByteArray(data: ByteBuffer): Array[Byte] = {
    val buffer = Array.ofDim[Byte](data.remaining())
    data.duplicate().get(buffer)
    buffer
  }

  println(deserialize(serialize(Data("Hello world!", 1L, TimeUnit.DAYS, 3L))))
}

It will print:它会打印:

Data(Hello world!,1,DAYS,3)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM