简体   繁体   English

Spark:数据帧序列化

[英]Spark: Dataframe Serialization

I have 2 questions regarding Spark serialization that I can simply find no answers to by googling. 我有2个关于Spark序列化的问题,我只能通过谷歌搜索找不到答案。

  1. How can I print out the name of the currently used serializer; 如何打印出当前使用的序列化程序的名称; I want to know whether spark.serializer is Java or Kryo. 我想知道spark.serializer是Java还是Kryo。
  2. I have the following code which is supposed to use Kryo serialization; 我有以下代码,应该使用Kryo序列化; the memory size used for the dataframe becomes 21meg which is a quarter of when I was just caching with no serialization; 用于数据帧的内存大小变为21meg,这是我刚刚缓存而没有序列化的四分之一; but when I remove the Kryo configuration, the size remains the same 21meg. 但是当我删除Kryo配置时,大小保持相同的21meg。 Does this mean Kryo was never used in the first place? 这是否意味着Kryo从未被用在第一位? Could it be that because the records in the dataframe are simply rows, both Java and Kryo serialization are the same size? 可能是因为数据框中的记录只是行,Java和Kryo​​序列化的大小相同吗?

     val conf = new SparkConf() conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") conf.set("spark.kryo.registrationRequired", "false") val spark = SparkSession.builder.master("local[*]").config(conf) .appName("KryoWithRegistrationNOTRequired").getOrCreate val df = spark.read.csv("09-MajesticMillion.csv") df.persist(StorageLevel.MEMORY_ONLY_SER) 

Does this mean Kryo was never used in the first place? 这是否意味着Kryo从未被用在第一位?

It means exactly it. 这意味着它。 Spark SQL ( Dataset ) uses it's own columnar storage for caching. Spark SQL( Dataset )使用它自己的柱状存储进行缓存。 No Java or Kryo serialization is used therefore spark.serializer has no impact at all. 没有使用Java或Kryo序列化因此spark.serializer根本没有影响。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM