[英]Filter based on JSON data which is in a string column in a Spark dataframe
I have a Spark dataframe in the below format, where FamilyDetails
column is a string field:我有以下格式的 Spark dataframe,其中
FamilyDetails
列是一个字符串字段:
root
|-- FirstName: string (nullable = true)
|-- LastName: string (nullable = true)
|-- FamilyDetails: string (nullable = true)
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|FirstName |LastName |FamilyDetails |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|Emma |Smith |{ |
| | | "23214598.31601190":{"gender":"F","Name":"Ms Olivia Smith","relationship":"Daughter"}, |
| | | "23214598.23214598":{"gender":"F","Name":"Ms Emma Smith","relationship":null} |
| | |} |
|Joe |Williams |{ |
| | | "2321463.2321463":{"gender":"M","Name":"Mr Joe Williams","relationship":null}, |
| | | "2321463.3841483":{"gender":"F","Name":"Mrs Sophia Williams","relationship":"Wife","IsActive":"N"} |
| | |} |
|Liam |Jones |{ |
| | | "2321464.12379942":{"gender":"F","Name":"Miss Patricia Jones","relationship":"Sister"}, |
| | | "2321464.2321464":{"gender":"M","Name":"Mr Liam Jones","relationship":null,"IsActive":"Y"} |
| | |} |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
What I am trying to do:我正在尝试做的事情:
I am trying to get records where we have inactive family members ( IsActive='N'
).我正在尝试获取我们有不活跃家庭成员(
IsActive='N'
)的记录。 Point to note that IsActive
is an optional field.需要注意的是
IsActive
是一个可选字段。
Expected output:预期 output:
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|FirstName |LastName |FamilyDetails |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+ |
|Joe |Williams |{ |
| | | "2321463.2321463":{"gender":"M","Name":"Mr Joe Williams","relationship":null}, |
| | | "2321463.3841483":{"gender":"F","Name":"Mrs Sophia Williams","relationship":"Wife","IsActive":"N"} |
| | |} |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
What I have tried so far:到目前为止我已经尝试过:
Since the complete schema is unknown, I tried to create the schema from FamilyDetails
column itself.由于完整的架构未知,我尝试从
FamilyDetails
列本身创建架构。
import org.apache.spark.sql.functions._
import spark.implicits._
val json_schema = spark.read.json(myDF.select("FamilyDetails").as[String]).schema
println(json_schema)
which gives me:这给了我:
StructType(
StructField(23214598.31601190,
StructType(
StructField(gender,StringType,true),
StructField(Name,StringType,true),
StructField(relationship,StringType,true),
StructField(IsActive,StringType,true)
)
,true
)
)
How do I get rid of the first value (2321463.2321463) and take only the required fields in the json schema?如何摆脱第一个值 (2321463.2321463) 并仅采用 json 架构中的必填字段? Or is there any easier approach to filter records where
IsActive = 'N'
?或者有没有更简单的方法来过滤
IsActive = 'N'
记录?
Perhaps you can avoid parsing the JSON by simply finding the string "IsActive":"N"
:也许您可以通过简单地查找字符串
"IsActive":"N"
来避免解析 JSON :
val df2 = df.filter("""FamilyDetails rlike '"IsActive":"N"'""")
For more rigorous parsing, you can use:对于更严格的解析,您可以使用:
val df2 = df.filter("exists(map_values(from_json(FamilyDetails, 'map<string,map<string,string>>')), x -> x['IsActive'] = 'N')")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.