简体   繁体   English

如何使用 spark/scala 将 json 字符串格式化为 MongoDB 文档样式?

[英]How to format json string as MongoDB document style using spark/scala?

I have a table with two columns id, json_string and need to convert json_string into a MongoDB document format.我有一个包含两列 id、json_string 的表,需要将 json_string 转换为 MongoDB 文档格式。 I'm sending data from Spark/Scala to MongoDB.我正在将数据从 Spark/Scala 发送到 MongoDB。

I tried using withColumn but I still don't get the desired format.我尝试使用 withColumn 但我仍然没有得到所需的格式。 This is what I have so far, so any help would be really appreciated.这是我到目前为止所拥有的,所以任何帮助都将不胜感激。

Original json string sample (df)原装 json 字符串样本(df)

val df=spark.sql("select id, json_string from mytable")

{"id":"0001","json_string":"{\"header\": {\"column1\":\"value1\",\"column2\":\"value2\"},\"tail\": [{\"column3\":\"value3\",\"column4\":\"value4\",\"column5\":\"value5\"}]}"}

Using withColumn (df2) I get this:使用 withColumn (df2) 我得到这个:

val df2=df.withColumn("json_string",from_json(col("json_string"),MapType(StringType,StringType)))

{"id":"0001","json_string":{"header":"{\"column1\":\"value1\",\"column2\":\"value2\"}","tail":"[{\"column3\":\"value3\",\"column4\":\"value4\",\"column5\":\"value5\"}]"}}

Desired format:所需格式:

{"id":{"$id":"0001"},"header":{"column1":"value1","column2":"value2"},"tail":[{"column3":"value3","column4":"value4","column5":"value5"}]}

Desired format picture sample所需格式图片样本

Instead of defining the schema manually, you can get it dynamically and use it with from_json您可以动态获取它并将其与from_json一起使用,而不是手动定义模式

val json_schema = spark.read.json(df.select("json_string").as[String]).schema
val df2 = df.withColumn("json_string", from_json(col("json_string"), json_schema))
  .select("id", "json_string.*")

Result:结果:

+----+----------------+--------------------------+
|id  |header          |tail                      |
+----+----------------+--------------------------+
|0001|{value1, value2}|[{value3, value4, value5}]|
+----+----------------+--------------------------+

Schema:架构:

root
 |-- id: string (nullable = true)
 |-- header: struct (nullable = true)
 |    |-- column1: string (nullable = true)
 |    |-- column2: string (nullable = true)
 |-- tail: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- column3: string (nullable = true)
 |    |    |-- column4: string (nullable = true)
 |    |    |-- column5: string (nullable = true)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM