简体   繁体   English

如何使用Scala将Map保存到Spark中的Json?

[英]How to save Map to Json in Spark using scala?

I need to save a Map(key-value pairs) in one column using Spark. 我需要使用Spark在一个列中保存一个Map(键-值对)。 The requirement is that other people may use the data with other tools like PIG, so it is better to save the Map with a general format rather than a special formatted string. 要求其他人可以将数据与其他工具(例如PIG)一起使用,因此最好使用通用格式而不是特殊格式的字符串保存Map。 I create the column using this code: 我使用以下代码创建该列:

StructField("cMap", DataTypes.createMapType(StringType, StringType), true) ::

Then after I create the dataframe, I got the schema: 然后,在创建数据框后,我得到了架构:

|-- cMap: map (nullable = true)
|    |-- key: string
|    |-- value: string (valueContainsNull = true)

Then I save the dataframe to Json: 然后我将数据帧保存到Json:

df.write.json(path)

I found the Json output is: 我发现Json的输出是:

"cMap":{"1":"a","2":"b","3":"c"}

So once I read it from the file next time: 因此,下次我从文件中读取该文件时:

val new_df = sqlContext.read.json(path)

I got the schema: 我得到了架构:

|-- cMap: struct (nullable = true)
|    |-- 1: string
|    |-- 2: string
|    |-- 3: string

Is there any efficient way to save and read the map in Json without extra processing( I could save the map into a special string and decoded it, but I think it should not be that complex). 有没有一种有效的方法可以在Json中保存和读取地图,而无需进行额外的处理(我可以将地图保存到一个特殊的字符串中并对其进行解码,但我认为它不应该那么复杂)。 Thanks. 谢谢。

You can save the table as parquet file 您可以将表另存为parquet文件

  • Write: 写:

    df.write.parquet("mydf.parquet") df.write.parquet( “mydf.parquet”)

  • Read

    val new_df = spark.read.parquet("mydf.parquet") val new_df = spark.read.parquet(“ mydf.parquet”)

spark guide save-modes 火花引导保存模式

 // Encoders for most common types are automatically provided by importing spark.implicits._ import spark.implicits._ val peopleDF = spark.read.json("examples/src/main/resources/people.json") // DataFrames can be saved as Parquet files, maintaining the schema information peopleDF.write.parquet("people.parquet") // Read in the parquet file created above // Parquet files are self-describing so the schema is preserved // The result of loading a Parquet file is also a DataFrame val parquetFileDF = spark.read.parquet("people.parquet") // Parquet files can also be used to create a temporary view and then used in SQL statements parquetFileDF.createOrReplaceTempView("parquetFile") val namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19") namesDF.map(attributes => "Name: " + attributes(0)).show() 

Parquet format should solve the issue you are having. Parquet格式应该可以解决您遇到的问题。 Parquet stores binary data in a column-oriented way, where the values of each column are organized so that they are all adjacent, enabling better compression

Just save it to Parquet as below 只需将其保存到Parquet ,如下所示

df.write.mode(SaveMode.Overwrite).parquet("path to the output")

And read it as below 并阅读如下

val new_df = sqlContext.read.parquet("path to the above output")

I hope this helps 我希望这有帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM