Scala Spark - 将 JSON 列拆分为多列

Question

Scala noob, using Spark 2.3.0 . Scala 菜鸟，使用Spark 2.3.0 。
I'm creating a DataFrame using a udf that creates a JSON String column:我正在使用创建 JSON 字符串列的 udf 创建 DataFrame：

val result: DataFrame = df.withColumn("decrypted_json", instance.decryptJsonUdf(df("encrypted_data")))

it outputs as follows:它的输出如下：

+----------------+---------------------------------------+
| encrypted_data | decrypted_json                        |
+----------------+---------------------------------------+
|eyJleHAiOjE1 ...| {"a":547.65 , "b":"Some Data"}        |
+----------------+---------------------------------------+

The UDF is an external code, that I can't change. UDF 是外部代码，我无法更改。 I would like to split the decrypted_json column into individual columns so the output DataFrame will be like so:我想将decrypted_json 列拆分为单独的列，以便输出DataFrame 如下所示：

+----------------+----------------------+
| encrypted_data | a      | b           |
+----------------+--------+-------------+
|eyJleHAiOjE1 ...| 547.65 | "Some Data" |
+----------------+--------+-------------+

Answer 1

Below solution is inspired by one of the solutions given by @Jacek Laskowski:以下解决方案的灵感来自@Jacek Laskowski 给出的解决方案之一：

import org.apache.spark.sql.types._
val JsonSchema = new StructType()
  .add($"a".string)
  .add($"b".string)
val schema = new StructType()
  .add($"encrypted_data".string)
  .add($"decrypted_json".array(JsonSchema))

val schemaAsJson = schema.json

import org.apache.spark.sql.types.DataType
val dt = DataType.fromJson(schemaAsJson)

import org.apache.spark.sql.functions._

val rawJsons = Seq("""
  {
    "encrypted_data" : "eyJleHAiOjE1",
    "decrypted_json" : [
      {
        "a" : "547.65",
        "b" : "Some Data"
      }
    ]
  }
""").toDF("rawjson")

val people = rawJsons
  .select(from_json($"rawjson", schemaAsJson, Map.empty[String, String]) as "json")
  .select("json.*") // <-- flatten the struct field
  .withColumn("address", explode($"decrypted_json")) // <-- explode the array field
  .drop("decrypted_json")  // <-- no longer needed
  .select("encrypted_data", "address.*") // <-- flatten the struct field

Please go through Link for the original solution with the explanation.请通过链接查看原始解决方案和解释。
I hope that helps.我希望这有帮助。

Answer 2

Using from_jason you can give parse the JSON into a Struct type then select columns from that dataframe.使用from_jason您可以将 JSON 解析为 Struct 类型，然后从该数据框中选择列。 You will need to know the schema of the json.您将需要知道 json 的架构。 Here is how -这是如何 -

    val sparkSession = //create spark session
    import sparkSession.implicits._

    val jsonData = """{"a":547.65 , "b":"Some Data"}"""
    val schema = {StructType(
      List(
        StructField("a", DoubleType, nullable = false),
        StructField("b", StringType, nullable = false)
      ))}

    val df = sparkSession.createDataset(Seq(("dummy data",jsonData))).toDF("string_column","json_column")
    val dfWithParsedJson = df.withColumn("json_data",from_json($"json_column",schema))

    dfWithParsedJson.select($"string_column",$"json_column",$"json_data.a", $"json_data.b").show()

Result结果

+-------------+------------------------------+------+---------+
|string_column|json_column                   |a     |b        |
+-------------+------------------------------+------+---------+
|dummy data   |{"a":547.65 , "b":"Some Data"}|547.65|Some Data|
+-------------+------------------------------+------+---------+

Scala Spark - 将 JSON 列拆分为多列

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-01-06 20:27:12

解决方案2
0 2020-01-06 14:02:43

Scala Spark - 将 JSON 列拆分为多列

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-01-06 20:27:12

解决方案2 0 2020-01-06 14:02:43

解决方案1
2 已采纳 2020-01-06 20:27:12

解决方案2
0 2020-01-06 14:02:43