基于 Spark dataframe 中字符串列中的 JSON 数据进行过滤

Question

I have a Spark dataframe in the below format, where FamilyDetails column is a string field:我有以下格式的 Spark dataframe，其中FamilyDetails列是一个字符串字段：

root
 |-- FirstName: string (nullable = true)
 |-- LastName: string (nullable = true)
 |-- FamilyDetails: string (nullable = true)

+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|FirstName |LastName |FamilyDetails                                                                                                             |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|Emma      |Smith    |{                                                                                                                         |
|          |         | "23214598.31601190":{"gender":"F","Name":"Ms Olivia Smith","relationship":"Daughter"},                                   |
|          |         | "23214598.23214598":{"gender":"F","Name":"Ms Emma Smith","relationship":null}                                            |
|          |         |}                                                                                                                         |
|Joe       |Williams |{                                                                                                                         |
|          |         |  "2321463.2321463":{"gender":"M","Name":"Mr Joe Williams","relationship":null},                                          |
|          |         |  "2321463.3841483":{"gender":"F","Name":"Mrs Sophia Williams","relationship":"Wife","IsActive":"N"}                      |
|          |         |}                                                                                                                         |
|Liam      |Jones    |{                                                                                                                         |
|          |         |  "2321464.12379942":{"gender":"F","Name":"Miss Patricia Jones","relationship":"Sister"},                                 |
|          |         |  "2321464.2321464":{"gender":"M","Name":"Mr Liam Jones","relationship":null,"IsActive":"Y"}                              |
|          |         |}                                                                                                                         |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+

What I am trying to do:我正在尝试做的事情：

I am trying to get records where we have inactive family members ( IsActive='N' ).我正在尝试获取我们有不活跃家庭成员（ IsActive='N' ）的记录。 Point to note that IsActive is an optional field.需要注意的是IsActive是一个可选字段。

Expected output:预期 output：

+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|FirstName |LastName |FamilyDetails                                                                                                             |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+                                                                                                                   |
|Joe       |Williams |{                                                                                                                         |
|          |         |  "2321463.2321463":{"gender":"M","Name":"Mr Joe Williams","relationship":null},                                          |
|          |         |  "2321463.3841483":{"gender":"F","Name":"Mrs Sophia Williams","relationship":"Wife","IsActive":"N"}                      |
|          |         |}                                                                                                                         |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+

What I have tried so far:到目前为止我已经尝试过：

Since the complete schema is unknown, I tried to create the schema from FamilyDetails column itself.由于完整的架构未知，我尝试从FamilyDetails列本身创建架构。

import org.apache.spark.sql.functions._
import spark.implicits._
val json_schema = spark.read.json(myDF.select("FamilyDetails").as[String]).schema
println(json_schema)

which gives me:这给了我：

StructType(
    StructField(23214598.31601190,
        StructType(
            StructField(gender,StringType,true), 
            StructField(Name,StringType,true), 
            StructField(relationship,StringType,true), 
            StructField(IsActive,StringType,true)
        )
        ,true
    )
)

How do I get rid of the first value (2321463.2321463) and take only the required fields in the json schema?如何摆脱第一个值 (2321463.2321463) 并仅采用 json 架构中的必填字段？ Or is there any easier approach to filter records where IsActive = 'N' ?或者有没有更简单的方法来过滤IsActive = 'N'记录？

Answer 1

Perhaps you can avoid parsing the JSON by simply finding the string "IsActive":"N" :也许您可以通过简单地查找字符串"IsActive":"N"来避免解析 JSON ：

val df2 = df.filter("""FamilyDetails rlike '"IsActive":"N"'""")

For more rigorous parsing, you can use:对于更严格的解析，您可以使用：

val df2 = df.filter("exists(map_values(from_json(FamilyDetails, 'map<string,map<string,string>>')), x -> x['IsActive'] = 'N')")

基于 Spark dataframe 中字符串列中的 JSON 数据进行过滤

问题描述

1 个解决方案

解决方案1
1 2021-03-24 09:38:36

基于 Spark dataframe 中字符串列中的 JSON 数据进行过滤

问题描述

1 个解决方案

解决方案1 1 2021-03-24 09:38:36

解决方案1
1 2021-03-24 09:38:36