简体   繁体   English

如何将Scala数据帧的每一行映射到新模式

[英]how to map each row of a scala dataframe to a new schema

I am processing streaming events of different types and different schema in spark with scala and I need to parse them, and save them in a format that's easy to process in a generic way further.我正在使用 scala 在 spark 中处理不同类型和不同模式的流事件,我需要解析它们,并将它们保存为易于以通用方式进一步处理的格式。

I have a dataframe of events that looks like this:我有一个如下所示的事件数据框:

val df = Seq(("{\"a\": 1, \"b\": 2, \"c\": 3 }", "One", "001") ,("{\"a\": 6, \"b\": 2, \"d\": 2, \"f\": 8 }", "Two", "089"), ("{\"a\": 3, \"b\": 4, \"c\": 6 }", "One", "123")).toDF("col1", "col2", "col3")

which is this:这是这样的:

+------------------------------------+--------+------+
|   body                             |   type |   id |
+------------------------------------+--------+------+
|{"a": 1, "b": 2, "c": 3 }           |   "One"|   001|
|{"a": 6, "d": 2, "f": 8, "g": 10}   |   "Two"|   089|
|{"a": 3, "b": 4, "c": 6 }           | "Three"|   123|
+------------------------------------+--------+------+

and I would like to turn it into this one.我想把它变成这个。 We can assume that all the type "One" will have the same schema, and all types of events will share some similar data such as the entry "a", which i would like to surface into its own column我们可以假设所有类型“One”都将具有相同的模式,并且所有类型的事件都将共享一些类似的数据,例如条目“a”,我想将其显示在其自己的列中

+---+--------------------------------+--------+------+
| a |  data                          |   y    |   z  |
+---+--------------------------------+--------+------+
| 1 |{"b": 2, "c": 3 }               |   "One"|   001|
| 6 |{"d": 2, "f": 8, "g": 10}       |   "Two"|   089|
| 3 |{"b": 4, "c": 6 }               | "Three"|   123|
+------------------------------------+--------+------+

One way to achieve that is to handle the json data as a Map as shown below:实现这一点的一种方法是将 json 数据作为 Map 处理,如下所示:

import org.apache.spark.sql.types.{MapType, StringType, IntegerType}
import org.apache.spark.sql.functions.{from_json, expr}

val df = Seq(
  ("{\"a\": 1, \"b\": 2, \"c\": 3 }", "One", "001") ,
  ("{\"a\": 6, \"b\": 2, \"d\": 2, \"f\": 8 }", "Two", "089"), 
  ("{\"a\": 3, \"b\": 4, \"c\": 6 }", "One", "123")
).toDF("body", "type", "id")

val mapSchema = MapType(StringType, IntegerType)

df.withColumn("map", from_json($"body", mapSchema))
  .withColumn("data_keys", expr("filter(map_keys(map), k -> k != 'a')"))
  .withColumn("data_values", expr("transform(data_keys, k -> element_at(map,k))"))
  .withColumn("data", expr("to_json(map_from_arrays(data_keys, data_values))"))
  .withColumn("a", $"map".getItem("a"))
  .select($"a", $"data", $"type".as("y"), $"id".as("z"))
  .show(false)

// +---+-------------------+---+---+
// |a  |data               |y  |z  |
// +---+-------------------+---+---+
// |1  |{"b":2,"c":3}      |One|001|
// |6  |{"b":2,"d":2,"f":8}|Two|089|
// |3  |{"b":4,"c":6}      |One|123|
// +---+-------------------+---+---+

Analysis分析

  1. withColumn("map", from_json($"body", mapSchema)) : first generate a Map from the given json data. withColumn("map", from_json($"body", mapSchema)) :首先从给定的 json 数据生成一个 Map。
  2. withColumn("data_keys", expr("filter(map_keys(map), k -> k != 'a')")) : retrieve the keys of the new map by filtering out the keys not equal to a . withColumn("data_keys", expr("filter(map_keys(map), k -> k != 'a')")) :通过过滤掉不等于a的键来检索新映射的键。 We use the filter function here which returns an array ie {"a": 1, "b": 2, "c": 3 } -> [b, c] .我们在这里使用 过滤器函数,它返回一个数组,即{"a": 1, "b": 2, "c": 3 } -> [b, c]
  3. withColumn("data_values", expr("transform(data_keys, k -> element_at(map,k))")) : populate the values of the new map using the previous keys in combination with transform . withColumn("data_values", expr("transform(data_keys, k -> element_at(map,k))")) :使用先前的键和transform组合填充新映射的值。
  4. withColumn("data", expr("to_json(map_from_arrays(data_keys, data_values))")) : generate the map from data_keys and data_values using map_from_arrays . withColumn("data", expr("to_json(map_from_arrays(data_keys, data_values))")) :使用map_from_arraysdata_keysdata_values生成映射。 Finally, call to_json for converting the map back to json.最后,调用to_json将地图转换回 json。

First you need to define the json schema as follows:首先,您需要按如下方式定义 json 架构:

val schema = spark.read.json(df.select("col1").as[String]).schema

Then you can transform your column col1 to json (1st line) and then just select which selements of the json you want to extract (2nd line):然后,您可以将列col1转换为 json(第一行),然后只需选择要提取的 json 元素(第二行):

df.select(from_json($"col1", schema).as("data"), $"col2", $"col3")
.select($"data.a", $"data", $"col2", $"col3")

Output:输出:

+---+-------------+----+----+
|  a|         data|col2|col3|
+---+-------------+----+----+
|  1|  [1, 2, 3,,]| One| 001|
|  6|[6, 2,, 2, 8]| Two| 089|
|  3|  [3, 4, 6,,]| One| 123|
+---+-------------+----+----+

I know it's not exactly the same as you want, but it will give you a clue.我知道它与您想要的不完全相同,但它会给您一个线索。

Other option if you want to deconstruct completely your json you can use data.*如果你想完全解构你的 json,你可以使用 data.*

    df.select(from_json($"col1", schema).as("data"), $"col2", $"col3").select($"data.*", $"col2", $"col3")

+---+---+----+----+----+----+----+
|  a|  b|   c|   d|   f|col2|col3|
+---+---+----+----+----+----+----+
|  1|  2|   3|null|null| One| 001|
|  6|  2|null|   2|   8| Two| 089|
|  3|  4|   6|null|null| One| 123|
+---+---+----+----+----+----+----+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM