I am learning Spark and I was building a sample project. I have a spark dataframe which has the following syntax. This syntax is when I saved the dataframe to a JSON file:
{"str":["1001","19035004":{"Name":"Chris","Age":"29","Location":"USA"}]}
{"str":["1002","19035005":{"Name":"John","Age":"20","Location":"France"}]}
{"str":["1003","19035006":{"Name":"Mark","Age":"30","Location":"UK"}]}
{"str":["1004","19035007":{"Name":"Mary","Age":"22","Location":"UK"}]}
JSONInput.show() gave me something like the below:
+---------------------------------------------------------------------+
|str |
+---------------------------------------------------------------------+
|[1001,{"19035004":{"Name":"Chris","Age":"29","Location":"USA"}}] |
|[1002,{"19035005":{"Name":"John","Age":"20","Location":"France"}}] |
|[1003,{"19035006":{"Name":"Mark","Age":"30","Location":"UK"}}] |
|[1004,{"19035007":{"Name":"Mary","Age":"22","Location":"UK"}}] |
+---------------------------------------------------------------------|
I know this is not the correct syntax for JSON, but this is what I have.
How can I get this in a relational structure in the first place (because I am pretty new to JSON and Spark. So this is not mandatory):
Name Age Location
-----------------------
Chris 29 USA
John 20 France
Mark 30 UK
Mary 22 UK
And I want to filter for the specific country:
val resultToReturn = JSONInput.filter("Location=USA")
But this results the below error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve
Location
given input columns: [str]; line 1 pos 0;
How do I get rid of "str" and make the data in a proper JSON structure? Can anyone help?
You can use from_json
to parse the string values:
import org.apache.spark.sql.types._
val schema = MapType(StringType,
StructType(Array(
StructField("Name", StringType, true),
StructField("Age", StringType, true),
StructField("Location", StringType, true)
)
), true
)
val resultToReturn = JSONInput.select(
explode(from_json(col("str")(1), schema))
).select("value.*")
resultToReturn.show
//+-----+---+--------+
//| Name|Age|Location|
//+-----+---+--------+
//|Chris| 29| USA|
//| John| 20| France|
//| Mark| 30| UK|
//| Mary| 22| UK|
//+-----+---+--------+
Then you can filter:
resultToReturn.filter("Location = 'USA'").show
//+-----+---+--------+
//| Name|Age|Location|
//+-----+---+--------+
//|Chris| 29| USA|
//+-----+---+--------+
You can extract the innermost JSON using regexp_extract
and parse that JSON using from_json
. Then you can star-expand the extracted JSON struct.
val parsed_df = JSONInput.selectExpr("""
from_json(
regexp_extract(str[0], '(\\{[^{}]+\\})', 1),
'Name string, Age string, Location string'
) as parsed
""").select("parsed.*")
parsed_df.show(false)
+-----+---+--------+
|Name |Age|Location|
+-----+---+--------+
|Chris|29 |USA |
|John |20 |France |
|Mark |30 |UK |
|Mary |22 |UK |
+-----+---+--------+
And you can filter it using
val filtered = parsed_df.filter("Location = 'USA'")
PS remember to add single quotes around USA.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.