简体   繁体   中英

Loading json data into hive using spark sql

I am Unable to push json data into hive Below is the sample json data and my work . Please suggest me the missing one

json Data

    {
"Employees" : [
{
"userId":"rirani",
"jobTitleName":"Developer",
"firstName":"Romin",
"lastName":"Irani",
"preferredFullName":"Romin Irani",
"employeeCode":"E1",
"region":"CA",
"phoneNumber":"408-1234567",
"emailAddress":"romin.k.irani@gmail.com"
},
{
"userId":"nirani",
"jobTitleName":"Developer",
"firstName":"Neil",
"lastName":"Irani",
"preferredFullName":"Neil Irani",
"employeeCode":"E2",
"region":"CA",
"phoneNumber":"408-1111111",
"emailAddress":"neilrirani@gmail.com"
},
{
"userId":"thanks",
"jobTitleName":"Program Directory",
"firstName":"Tom",
"lastName":"Hanks",
"preferredFullName":"Tom Hanks",
"employeeCode":"E3",
"region":"CA",
"phoneNumber":"408-2222222",
"emailAddress":"tomhanks@gmail.com"
}
]
}

I tried to use sqlcontext and jsonFile method to load which is failing to parse the json

val f = sqlc.jsonFile("file:///home/vm/Downloads/emp.json")
f.show 

error is :  java.lang.RuntimeException: Failed to parse a value for data type StructType() (current token: VALUE_STRING)

I tried in different way and able to crack and get the schema

val files = sc.wholeTextFiles("file:///home/vm/Downloads/emp.json")        
val jsonData = files.map(x => x._2)
sqlc.jsonRDD(jsonData).registerTempTable("employee")
val emp= sqlc.sql("select Employees[1].userId as ID,Employees[1].jobTitleName as Title,Employees[1].firstName as FirstName,Employees[1].lastName as LastName,Employees[1].preferredFullName as PeferedName,Employees[1].employeeCode as empCode,Employees[1].region as Region,Employees[1].phoneNumber as Phone,Employees[1].emailAddress as email from employee")
emp.show // displays all the values

I am able to get the data and schema seperately for each record but I am missing an idea to get all the data and load into hive.

Any help or suggestion is much appreaciated.

Here is the Cracked answer

val files = sc.wholeTextFiles("file:///home/vm/Downloads/emp.json")
val jsonData = files.map(x => x._2)
import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive.HiveContext
val hc=new HiveContext(sc)
hc.jsonRDD(jsonData).registerTempTable("employee")
val fuldf=hc.jsonRDD(jsonData)
val dfemp=fuldf.select(explode(col("Employees")))
dfemp.saveAsTable("empdummy")
val df=sql("select * from empdummy")
df.select ("_c0.userId","_c0.jobTitleName","_c0.firstName","_c0.lastName","_c0.preferredFullName","_c0.employeeCode","_c0.region","_c0.phoneNumber","_c0.emailAddress").saveAsTable("dummytab")

Any suggestion for optimising the above code.

SparkSQL only supports reading JSON files when the file contains one JSON object per line .

SQLContext.scala

  /**
   * Loads a JSON file (one object per line), returning the result as a [[DataFrame]].
   * It goes through the entire dataset once to determine the schema.
   *
   * @group specificdata
   * @deprecated As of 1.4.0, replaced by `read().json()`. This will be removed in Spark 2.0.
   */
  @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
  def jsonFile(path: String): DataFrame = {
    read.json(path)
  }

Your file should look like this (strictly speaking, it's not a proper JSON file)

{"userId":"rirani","jobTitleName":"Developer","firstName":"Romin","lastName":"Irani","preferredFullName":"Romin Irani","employeeCode":"E1","region":"CA","phoneNumber":"408-1234567","emailAddress":"romin.k.irani@gmail.com"}
{"userId":"nirani","jobTitleName":"Developer","firstName":"Neil","lastName":"Irani","preferredFullName":"Neil Irani","employeeCode":"E2","region":"CA","phoneNumber":"408-1111111","emailAddress":"neilrirani@gmail.com"} 
{"userId":"thanks","jobTitleName":"Program Directory","firstName":"Tom","lastName":"Hanks","preferredFullName":"Tom Hanks","employeeCode":"E3","region":"CA","phoneNumber":"408-2222222","emailAddress":"tomhanks@gmail.com"}

Please have a look at the outstanding JIRA issue . Don't think it is that much of priority, but just for record.

You have two options

  1. Convert your json data to the supported format, one object per line
  2. Have one file per JSON object - this will result in too many files.

Note that SQLContext.jsonFile is deprecated, use SQLContext.read.json .

Examples from spark documentation

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM