简体   繁体   中英

how to parse the wiki infobox json with scala spark

I was trying to get the data from json data which I got it from wiki api

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=Rajanna&rvsection=0

I was able to print the schema of that exactly

scala> data.printSchema
root
 |-- batchcomplete: string (nullable = true)
 |-- query: struct (nullable = true)
 |    |-- pages: struct (nullable = true)
 |    |    |-- 28597189: struct (nullable = true)
 |    |    |    |-- ns: long (nullable = true)
 |    |    |    |-- pageid: long (nullable = true)
 |    |    |    |-- revisions: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- *: string (nullable = true)    
 |    |    |    |    |    |-- contentformat: string (nullable = true)
 |    |    |    |    |    |-- contentmodel: string (nullable = true)
 |    |    |    |-- title: string (nullable = true)

I want to extract the data of the key "*" |-- *: string (nullable = true) Please suggest me a solution.

One problem is

pages: struct (nullable = true)
     |    |    |-- 28597189: struct (nullable = true)

the number 28597189 is unique to every title.

First we need to parse the json to get the key (28597189) dynamically then use this to extract the data from spark dataframe like below

val keyName = dataFrame.selectExpr("query.pages.*").schema.fieldNames(0)
println(s"Key Name : $keyName")

this will give you the key dynamically:

Key Name : 28597189

Then use this to extract the data

var revDf = dataFrame.select(explode(dataFrame(s"query.pages.$keyName.revisions")).as("revision")).select("revision.*")
revDf.printSchema()

Output:

root
 |-- *: string (nullable = true)
 |-- contentformat: string (nullable = true)
 |-- contentmodel: string (nullable = true)

and we will be renaming the column * with some key name like star_column

revDf = revDf.withColumnRenamed("*", "star_column")
revDf.printSchema()

Output:

root
 |-- star_column: string (nullable = true)
 |-- contentformat: string (nullable = true)
 |-- contentmodel: string (nullable = true)

and once we have our final dataframe we will call show

revDf.show()

Output:

+--------------------+-------------+------------+
|         star_column|contentformat|contentmodel|
+--------------------+-------------+------------+
|{{EngvarB|date=Se...|  text/x-wiki|    wikitext|
+--------------------+-------------+------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM