[英]Parsing JSON within a Spark DataFrame into new columns
I have a dataframe that looks like this:我有一个看起来像这样的 dataframe:
------------------------------------------------------------------------
|name |meals |
------------------------------------------------------------------------
|Tom |{"breakfast": "banana", "lunch": "sandwich"} |
|Alex |{"breakfast": "yogurt", "lunch": "pizza", "dinner": "pasta"} |
|Lisa |{"lunch": "sushi", "dinner": "lasagna", "snack": "apple"} |
------------------------------------------------------------------------
Obtained from the following:从以下获得:
var rawDf = Seq(("Tom",s"""{"breakfast": "banana", "lunch": "sandwich"}""" ),
("Alex", s"""{"breakfast": "yogurt", "lunch": "pizza", "dinner": "pasta"}"""),
("Lisa", s"""{"lunch": "sushi", "dinner": "lasagna", "snack": "apple"}""")).toDF("name", "meals")
I want to transform it into a dataframe that looks like this:我想将其转换为如下所示的 dataframe:
------------------------------------------------------------------------
|name |meal |food |
------------------------------------------------------------------------
|Tom |breakfast | banana |
|Tom |lunch | sandwich |
|Alex |breakfast | yogurt |
|Alex |lunch | pizza |
|Alex |dinner | pasta |
|Lisa |lunch | sushi |
|Lisa |dinner | lasagna |
|Lisa |snack | apple |
------------------------------------------------------------------------
I'm using Spark 2.1, so I'm parsing the json using get_json_object.我使用的是 Spark 2.1,所以我使用 get_json_object 解析 json。 Currently, I'm trying to get the final dataframe using an intermediary dataframe that looks like this:目前,我正在尝试使用如下所示的中间 dataframe 获得最终的 dataframe:
------------------------------------------------------------------------
|name |breakfast |lunch |dinner |snack |
------------------------------------------------------------------------
|Tom |banana |sandwich |null |null |
|Alex |yogurt |pizza |pasta |null |
|Lisa |null |sushi |lasagna |apple |
------------------------------------------------------------------------
Obtained from the following:从以下获得:
val intermediaryDF = rawDf.select(col("name"),
get_json_object(col("meals"), "$." + Meals.breakfast).alias(Meals.breakfast),
get_json_object(col("meals"), "$." + Meals.lunch).alias(Meals.lunch),
get_json_object(col("meals"), "$." + Meals.dinner).alias(Meals.dinner),
get_json_object(col("meals"), "$." + Meals.snack).alias(Meals.snack))
Meals
is defined in another file that has a lot more entries than breakfast
, lunch
, dinner
, and snack
, but it looks something like this: Meals
是在另一个文件中定义的,该文件的条目比breakfast
、 lunch
、 dinner
和snack
多得多,但它看起来像这样:
object Meals {
val breakfast = "breakfast"
val lunch = "lunch"
val dinner = "dinner"
val snack = "snack"
}
I then use intermediaryDF
to compute the final DataFrame, like so:然后我使用intermediaryDF
计算最终的 DataFrame,如下所示:
val finalDF = parsedDF.where(col("breakfast").isNotNull).select(col("name"), col("breakfast")).union(
parsedDF.where(col("lunch").isNotNull).select(col("name"), col("lunch"))).union(
parsedDF.where(col("dinner").isNotNull).select(col("name"), col("dinner"))).union(
parsedDF.where(col("snack").isNotNull).select(col("name"), col("snack")))
Using the intermediary DataFrame works if I only have a few types of Meals
, but I actually have 40, and enumerating every one of them to compute intermediaryDF
is impractical.如果我只有几种Meals
类型,则使用中介 DataFrame 可以工作,但实际上我有 40 种,并且枚举它们中的每一个来计算intermediaryDF
是不切实际的。 I also don't like the idea of having to compute this DF in the first place.我也不喜欢一开始就必须计算这个 DF 的想法。 Is there a way to get directly from my raw dataframe to the final dataframe without the intermediary step, and also without explicitly having a case for every value in Meals
?有没有办法直接从我的原始 dataframe 到最终的 dataframe 没有中间步骤,也没有明确说明Meals
中的每个值的情况?
Apache Spark provide support to parse json data, but that should have a predefined schema in order to parse it correclty. Apache Spark 支持解析 json 数据,但应该有一个预定义的模式才能正确解析它。 Your json data is dynamic so you cannot rely on a schema.您的 json 数据是动态的,因此您不能依赖模式。
One way to do don;t let apache spark parse the data, but you could parse it in a key value way, (eg by using something like Map[String, String]
which is pretty generic)一种方法不要让 apache spark 解析数据,但您可以以键值方式解析它(例如,通过使用类似Map[String, String]
的东西,这是非常通用的)
Here is what you can do instead:您可以这样做:
Use the Jackson json mapper for scala将 Jackson json 映射器用于 scala
// mapper object created on each executor node
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.registerModule(DefaultScalaModule)
val valueAsMap = mapper.readValue[Map[String, String]](s"""{"breakfast": "banana", "lunch": "sandwich"}""")
This will give you something like transforming the json string into a Map[String, String].这将为您提供类似于将 json 字符串转换为 Map[String, String] 的内容。 That can also be viewed as a List of (key, value) pair这也可以看作是(键,值)对的列表
List((breakfast,banana), (lunch,sandwich))
Now comes the Apache Spark part into the play.现在,Apache Spark 部件出现了。 Define a custom user defined function to parse the string and output the List of (key, value) pairs定义一个自定义的用户定义的 function 来解析字符串和 output 的(键,值)对列表
val jsonToArray = udf((json:String) => {
mapper.readValue[Map[String, String]](json).toList
})
Apply that transformation on the "meals" columns and will transform that into a column of type Array.将该转换应用于“餐”列,并将其转换为数组类型的列。 After that explode
on that columns and select the key entry as column meal
and value entry as column food
之后在那列上explode
,select 作为列food
的键条目和作为列meal
的值条目
val df1 = rowDf.select(col("name"), explode(jsonToArray(col("meals"))).as("meals"))
df1.select(col("name"), col("meals._1").as("meal"), col("meals._2").as("food"))
Showing the last dataframe it outputs:显示它输出的最后一个 dataframe:
|name| meal| food|
+----+---------+--------+
| Tom|breakfast| banana|
| Tom| lunch|sandwich|
|Alex|breakfast| yogurt|
|Alex| lunch| pizza|
|Alex| dinner| pasta|
|Lisa| lunch| sushi|
|Lisa| dinner| lasagna|
|Lisa| snack| apple|
+----+---------+--------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.