[英]Scala Spark - Split JSON column to multiple columns
Scala noob, using Spark 2.3.0 . Scala 菜鸟,使用Spark 2.3.0 。
I'm creating a DataFrame using a udf that creates a JSON String column:我正在使用创建 JSON 字符串列的 udf 创建 DataFrame:
val result: DataFrame = df.withColumn("decrypted_json", instance.decryptJsonUdf(df("encrypted_data")))
it outputs as follows:它的输出如下:
+----------------+---------------------------------------+
| encrypted_data | decrypted_json |
+----------------+---------------------------------------+
|eyJleHAiOjE1 ...| {"a":547.65 , "b":"Some Data"} |
+----------------+---------------------------------------+
The UDF is an external code, that I can't change. UDF 是外部代码,我无法更改。 I would like to split the decrypted_json column into individual columns so the output DataFrame will be like so:
我想将decrypted_json 列拆分为单独的列,以便输出DataFrame 如下所示:
+----------------+----------------------+
| encrypted_data | a | b |
+----------------+--------+-------------+
|eyJleHAiOjE1 ...| 547.65 | "Some Data" |
+----------------+--------+-------------+
Below solution is inspired by one of the solutions given by @Jacek Laskowski:以下解决方案的灵感来自@Jacek Laskowski 给出的解决方案之一:
import org.apache.spark.sql.types._
val JsonSchema = new StructType()
.add($"a".string)
.add($"b".string)
val schema = new StructType()
.add($"encrypted_data".string)
.add($"decrypted_json".array(JsonSchema))
val schemaAsJson = schema.json
import org.apache.spark.sql.types.DataType
val dt = DataType.fromJson(schemaAsJson)
import org.apache.spark.sql.functions._
val rawJsons = Seq("""
{
"encrypted_data" : "eyJleHAiOjE1",
"decrypted_json" : [
{
"a" : "547.65",
"b" : "Some Data"
}
]
}
""").toDF("rawjson")
val people = rawJsons
.select(from_json($"rawjson", schemaAsJson, Map.empty[String, String]) as "json")
.select("json.*") // <-- flatten the struct field
.withColumn("address", explode($"decrypted_json")) // <-- explode the array field
.drop("decrypted_json") // <-- no longer needed
.select("encrypted_data", "address.*") // <-- flatten the struct field
Please go through Link for the original solution with the explanation.请通过链接查看原始解决方案和解释。
I hope that helps.我希望这有帮助。
Using from_jason
you can give parse the JSON into a Struct type then select columns from that dataframe.使用
from_jason
您可以将 JSON 解析为 Struct 类型,然后从该数据框中选择列。 You will need to know the schema of the json.您将需要知道 json 的架构。 Here is how -
这是如何 -
val sparkSession = //create spark session
import sparkSession.implicits._
val jsonData = """{"a":547.65 , "b":"Some Data"}"""
val schema = {StructType(
List(
StructField("a", DoubleType, nullable = false),
StructField("b", StringType, nullable = false)
))}
val df = sparkSession.createDataset(Seq(("dummy data",jsonData))).toDF("string_column","json_column")
val dfWithParsedJson = df.withColumn("json_data",from_json($"json_column",schema))
dfWithParsedJson.select($"string_column",$"json_column",$"json_data.a", $"json_data.b").show()
Result结果
+-------------+------------------------------+------+---------+
|string_column|json_column |a |b |
+-------------+------------------------------+------+---------+
|dummy data |{"a":547.65 , "b":"Some Data"}|547.65|Some Data|
+-------------+------------------------------+------+---------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.