简体   繁体   中英

Compare two Json files using Apache Spark

I am new to Apache Spark and I am trying to compare two json files. My requirement is to find out that which key/value is added, removed or modified and what is its path.

To explain my problem, I am sharing the code which I have tried with a small json sample here.

Sample Json 1 is:

{
"employee": {
"name": "sonoo",
"salary": 57000,
"married": true
} }

Sample Json 2 is:

{
"employee": {
"name": "sonoo",
"salary": 58000,
"married": true
} }

My code is:

//Compare two multiline json files
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Load first json file
val jsonData_1 = sqlContext.read.json(sc.wholeTextFiles("D:\\File_1.json").values)

//Load second json file
val jsonData_2 = sqlContext.read.json(sc.wholeTextFiles("D:\\File_2.json").values)
//Compare both json files
jsonData_2.except(jsonData_1).show(false)

The output which I get on executing this code is:

+--------------------+
|employee            |
+--------------------+
|{true, sonoo, 58000}|
+--------------------+

But here only one field ie salary was modified so output should be only the updated field with its path.

Below is the expected output details:

[
  {
    "op" : "replace",
    "path" : "/employee/salary",
    "value" : 58000
  }
]

Can anyone point me in the right direction?

Assuming each json has an identifier, and that you have two json groups (eg folders), you need to compare b/w the jsons in the two groups:

  1. Load the jsons from each group into a dataframe, providing a schema matching the structure of the son. After this, you have two dataframes.
  2. Compare the jsons (by now rows in a dataframe) by joining on the identifiers, looking for mismatched values.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM