简体   繁体   English

Spark from_json - StructType和ArrayType

[英]Spark from_json - StructType and ArrayType

I have a data set that comes in as XML, and one of the nodes contains JSON. 我有一个以XML形式出现的数据集,其中一个节点包含JSON。 Spark is reading this in as a StringType, so I am trying to use from_json() to convert the JSON to a DataFrame. Spark正在将其作为StringType读取,因此我尝试使用from_json()将JSON转换为DataFrame。

I am able to convert a string of JSON, but how do I write the schema to work with an Array? 我能够转换一串JSON,但是如何编写模式以使用数组呢?

String without Array - Working nicely 没有数组的字符串 - 工作得很好

import org.apache.spark.sql.functions._

val schemaExample = new StructType()
          .add("FirstName", StringType)
          .add("Surname", StringType)

val dfExample = spark.sql("""select "{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }" as theJson""")

val dfICanWorkWith = dfExample.select(from_json($"theJson", schemaExample))

dfICanWorkWith.collect()

// Results \\
res19: Array[org.apache.spark.sql.Row] = Array([[Johnny,Boy]])

String with an Array - Can't figure this one out 带有数组的字符串 - 无法解决这个问题

import org.apache.spark.sql.functions._

val schemaExample2 = new StructType()
                              .add("", ArrayType(new StructType()
                                                          .add("FirstName", StringType)
                                                          .add("Surname", StringType)
                                                )
                                  )

val dfExample2= spark.sql("""select "[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }" as theJson""")

val dfICanWorkWith = dfExample2.select(from_json($"theJson", schemaExample2))

dfICanWorkWith.collect()

// Result \\
res22: Array[org.apache.spark.sql.Row] = Array([null])

The problem is that you don't have a fully qualified json. 问题是你没有完全合格的json。 Your json is missing a couple of things: 你的json缺少一些东西:

  • First you are missing the surrounding {} in which the json is done 首先,你错过了json完成的周围{}
  • Second you are missing the variable value (you set it as "" but did not add it) 其次,您缺少变量值(您将其设置为“”但未添加它)
  • Lastly you are missing the closing ] 最后你错过了结束]

Try replacing it with: 尝试将其替换为:

val dfExample2= spark.sql("""select "{\"\":[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }]}" as theJson""")

and you will get: 你会得到:

scala> dfICanWorkWith.collect()
res12: Array[org.apache.spark.sql.Row] = Array([[WrappedArray([Johnny,Boy], [Franky,Man])]])

as of spark 2.4 the schema_of_json function helps: 从spark 2.4开始, schema_of_json函数有助于:

> SELECT schema_of_json('[{"col":0}]');
  array<struct<col:int>>

in your case you can then use the below code to parse that array of son objects: 在您的情况下,您可以使用以下代码来解析该子对象数组:

scala> spark.sql("""select from_json("[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }]", 'array<struct<FirstName:string,Surname:string>>' ) as theJson""").show(false)
+------------------------------+
|theJson                       |
+------------------------------+
|[[Johnny, Boy], [Franky, Man]]|
+------------------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM