简体   繁体   English

基于 Spark dataframe 中字符串列中的 JSON 数据进行过滤

[英]Filter based on JSON data which is in a string column in a Spark dataframe

I have a Spark dataframe in the below format, where FamilyDetails column is a string field:我有以下格式的 Spark dataframe,其中FamilyDetails列是一个字符串字段:

root
 |-- FirstName: string (nullable = true)
 |-- LastName: string (nullable = true)
 |-- FamilyDetails: string (nullable = true)

+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|FirstName |LastName |FamilyDetails                                                                                                             |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|Emma      |Smith    |{                                                                                                                         |
|          |         | "23214598.31601190":{"gender":"F","Name":"Ms Olivia Smith","relationship":"Daughter"},                                   |
|          |         | "23214598.23214598":{"gender":"F","Name":"Ms Emma Smith","relationship":null}                                            |
|          |         |}                                                                                                                         |
|Joe       |Williams |{                                                                                                                         |
|          |         |  "2321463.2321463":{"gender":"M","Name":"Mr Joe Williams","relationship":null},                                          |
|          |         |  "2321463.3841483":{"gender":"F","Name":"Mrs Sophia Williams","relationship":"Wife","IsActive":"N"}                      |
|          |         |}                                                                                                                         |
|Liam      |Jones    |{                                                                                                                         |
|          |         |  "2321464.12379942":{"gender":"F","Name":"Miss Patricia Jones","relationship":"Sister"},                                 |
|          |         |  "2321464.2321464":{"gender":"M","Name":"Mr Liam Jones","relationship":null,"IsActive":"Y"}                              |
|          |         |}                                                                                                                         |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+

What I am trying to do:我正在尝试做的事情:

I am trying to get records where we have inactive family members ( IsActive='N' ).我正在尝试获取我们有不活跃家庭成员( IsActive='N' )的记录。 Point to note that IsActive is an optional field.需要注意的是IsActive是一个可选字段。

Expected output:预期 output:

+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|FirstName |LastName |FamilyDetails                                                                                                             |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+                                                                                                                   |
|Joe       |Williams |{                                                                                                                         |
|          |         |  "2321463.2321463":{"gender":"M","Name":"Mr Joe Williams","relationship":null},                                          |
|          |         |  "2321463.3841483":{"gender":"F","Name":"Mrs Sophia Williams","relationship":"Wife","IsActive":"N"}                      |
|          |         |}                                                                                                                         |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+

What I have tried so far:到目前为止我已经尝试过:

Since the complete schema is unknown, I tried to create the schema from FamilyDetails column itself.由于完整的架构未知,我尝试从FamilyDetails列本身创建架构。

import org.apache.spark.sql.functions._
import spark.implicits._
val json_schema = spark.read.json(myDF.select("FamilyDetails").as[String]).schema
println(json_schema)

which gives me:这给了我:

StructType(
    StructField(23214598.31601190,
        StructType(
            StructField(gender,StringType,true), 
            StructField(Name,StringType,true), 
            StructField(relationship,StringType,true), 
            StructField(IsActive,StringType,true)
        )
        ,true
    )
)

How do I get rid of the first value (2321463.2321463) and take only the required fields in the json schema?如何摆脱第一个值 (2321463.2321463) 并仅采用 json 架构中的必填字段? Or is there any easier approach to filter records where IsActive = 'N' ?或者有没有更简单的方法来过滤IsActive = 'N'记录?

Perhaps you can avoid parsing the JSON by simply finding the string "IsActive":"N" :也许您可以通过简单地查找字符串"IsActive":"N"来避免解析 JSON :

val df2 = df.filter("""FamilyDetails rlike '"IsActive":"N"'""")

For more rigorous parsing, you can use:对于更严格的解析,您可以使用:

val df2 = df.filter("exists(map_values(from_json(FamilyDetails, 'map<string,map<string,string>>')), x -> x['IsActive'] = 'N')")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Pandas/spark 数据框中拆分 json 字符串列? - How to split a json string column in pandas/spark dataframe? 将 Spark DataFrame 列中的 JSON 数据转换为表格格式 - Convert JSON Data in Spark DataFrame column into tabular format DataFrame:基于其他列的新列,带有字符串字典/json - DataFrame: New column based on other column with string dictionary/json 使用 Spark 解析 Spark 数据框中的 JSON 列 - Parse a JSON column in a spark dataframe using Spark 使用 spark/scala 按照 json 文件中首先列出的列的顺序将 json 转换为数据帧 - Convert json to dataframe in order of which column listed first in json file using spark/scala 将具有模式的Spark Dataframe转换为json字符串的dataframe - Convert spark Dataframe with schema to dataframe of json String 如何根据 id 将 spark dataframe 列的所有唯一值合并为单行并将列转换为 json 格式 - How to merge all unique values of a spark dataframe column into single row based on id and convert the column into json format Spark DF 列到字符串 JSON - Spark DF column to string JSON 根据特定列数据将 Pyspark dataframe 拆分为多个 json 文件? - Split Pyspark dataframe into multiple json files based on a particular column data? 数据帧列scala中的火花流JSON值 - spark streaming JSON value in dataframe column scala
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM