简体   繁体   中英

Reading JSON into spark dataframe

I am learning Spark and I was building a sample project. I have a spark dataframe which has the following syntax. This syntax is when I saved the dataframe to a JSON file:

{"str":["1001","19035004":{"Name":"Chris","Age":"29","Location":"USA"}]}
{"str":["1002","19035005":{"Name":"John","Age":"20","Location":"France"}]}
{"str":["1003","19035006":{"Name":"Mark","Age":"30","Location":"UK"}]}
{"str":["1004","19035007":{"Name":"Mary","Age":"22","Location":"UK"}]}

JSONInput.show() gave me something like the below:

+---------------------------------------------------------------------+
|str                                                                  |
+---------------------------------------------------------------------+
|[1001,{"19035004":{"Name":"Chris","Age":"29","Location":"USA"}}]     |
|[1002,{"19035005":{"Name":"John","Age":"20","Location":"France"}}]   |
|[1003,{"19035006":{"Name":"Mark","Age":"30","Location":"UK"}}]       |
|[1004,{"19035007":{"Name":"Mary","Age":"22","Location":"UK"}}]       |
+---------------------------------------------------------------------|

I know this is not the correct syntax for JSON, but this is what I have.

How can I get this in a relational structure in the first place (because I am pretty new to JSON and Spark. So this is not mandatory):

Name    Age   Location
-----------------------
Chris   29    USA
John    20    France
Mark    30    UK
Mary    22    UK

And I want to filter for the specific country:

val resultToReturn = JSONInput.filter("Location=USA")

But this results the below error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve Location given input columns: [str]; line 1 pos 0;

How do I get rid of "str" and make the data in a proper JSON structure? Can anyone help?

You can use from_json to parse the string values:

import org.apache.spark.sql.types._

val schema = MapType(StringType,
  StructType(Array(
      StructField("Name", StringType, true),
      StructField("Age", StringType, true),
      StructField("Location", StringType, true)
    )
  ), true
)

val resultToReturn = JSONInput.select(
    explode(from_json(col("str")(1), schema))
  ).select("value.*")

resultToReturn.show
//+-----+---+--------+
//| Name|Age|Location|
//+-----+---+--------+
//|Chris| 29|     USA|
//| John| 20|  France|
//| Mark| 30|      UK|
//| Mary| 22|      UK|
//+-----+---+--------+

Then you can filter:

resultToReturn.filter("Location = 'USA'").show
//+-----+---+--------+
//| Name|Age|Location|
//+-----+---+--------+
//|Chris| 29|     USA|
//+-----+---+--------+

You can extract the innermost JSON using regexp_extract and parse that JSON using from_json . Then you can star-expand the extracted JSON struct.

val parsed_df = JSONInput.selectExpr("""
    from_json(
        regexp_extract(str[0], '(\\{[^{}]+\\})', 1),
        'Name string, Age string, Location string'
    ) as parsed
""").select("parsed.*")

parsed_df.show(false)
+-----+---+--------+
|Name |Age|Location|
+-----+---+--------+
|Chris|29 |USA     |
|John |20 |France  |
|Mark |30 |UK      |
|Mary |22 |UK      |
+-----+---+--------+

And you can filter it using

val filtered = parsed_df.filter("Location = 'USA'")

PS remember to add single quotes around USA.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM