简体   繁体   English

使用 Spark 和 Scala 将文本文件中的 JSON 对象解析并展平为 Dataframe

[英]Parse & flatten JSON object in a text file using Spark & Scala into Dataframe

I have a text file with below structure.我有一个具有以下结构的文本文件。

(employeeID: Int, Name: String, ProjectDetails: JsonObject{[{ProjectName, Description, Duriation, Role}]})

Eg:例如:

(123456, Employee1, {“ProjectDetails”:[ 
                                       { “ProjectName”: “Web Develoement”, “Description” : “Online Sales website”, “Duration” : “6 Months” , “Role” : “Developer”}
                                       { “ProjectName”: “Spark Develoement”, “Description” : “Online Sales Analysis”, “Duration” : “6 Months” , “Role” : “Data Engineer”}
                                       { “ProjectName”: “Scala Training”, “Description” : “Training”, “Duration” : “1 Month” }
                                       ]
                     }

Could someone help me to parse & flatten the record as below dataframe using scala?有人可以帮助我使用scala解析和展平如下数据框的记录吗?

employeeID, Name, ProjectName, Description, Duration, Role
123456, Employee1, Web Develoement, Online Sales website, 6 Months , Developer
123456, Employee1, Spark Develoement, Online Sales Analysis, 6 Months, Data Engineer
123456, Employee1, Scala Training, Training, 1 Month, null

You can try this .. But slightly modified the input structure since first two columns were not in Json format.你可以试试这个 .. 但稍微修改了输入结构,因为前两列不是 Json 格式。

 scala> import org.apache.spark.SparkConf scala> import org.apache.spark.SparkContext scala> import org.apache.spark.sql.SQLContext scala> import org.apache.spark.sql._ scala> val sqlSC = new org.apache.spark.sql.SQLContext(sc) scala> import sqlSC.implicits._ scala> val emp_DF =sqlSC.jsonFile("file:///C:/Users/ABCD/Desktop/Examples/Spark/Mailing List/Employee_Nested_Projects.json") scala> case class ProjectInfo(ProjectName:String,Description:String,Duration:String,Role:String) scala> case class Project(employeeID:Int,Name:String,ProjectDetails:Seq[ProjectInfo]) scala> val emp_projects_DF = emp_DF.explode(emp_DF("ProjectDetails")) { case Row(x: Seq[Row])=> x.map(x=> ProjectInfo(x(0).asInstanceOf[String],x(1).asInstanceOf[String],x(2).asInstanceOf[String],x(3).asInstanceOf[String]))} scala> emp_projects_DF.select($"employeeID",$"Name",$"ProjectName",$"Description",$"Duration",$"Role").show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM