简体   繁体   English

使用Spark SQL将JSON数据加载到Hive中

[英]Loading json data into hive using spark sql

I am Unable to push json data into hive Below is the sample json data and my work . 我无法将json数据推送到配置单元中以下是示例json数据和我的工作。 Please suggest me the missing one 请向我建议失踪者

json Data json资料

    {
"Employees" : [
{
"userId":"rirani",
"jobTitleName":"Developer",
"firstName":"Romin",
"lastName":"Irani",
"preferredFullName":"Romin Irani",
"employeeCode":"E1",
"region":"CA",
"phoneNumber":"408-1234567",
"emailAddress":"romin.k.irani@gmail.com"
},
{
"userId":"nirani",
"jobTitleName":"Developer",
"firstName":"Neil",
"lastName":"Irani",
"preferredFullName":"Neil Irani",
"employeeCode":"E2",
"region":"CA",
"phoneNumber":"408-1111111",
"emailAddress":"neilrirani@gmail.com"
},
{
"userId":"thanks",
"jobTitleName":"Program Directory",
"firstName":"Tom",
"lastName":"Hanks",
"preferredFullName":"Tom Hanks",
"employeeCode":"E3",
"region":"CA",
"phoneNumber":"408-2222222",
"emailAddress":"tomhanks@gmail.com"
}
]
}

I tried to use sqlcontext and jsonFile method to load which is failing to parse the json 我试图使用sqlcontext和jsonFile方法加载无法解析json的内容

val f = sqlc.jsonFile("file:///home/vm/Downloads/emp.json")
f.show 

error is :  java.lang.RuntimeException: Failed to parse a value for data type StructType() (current token: VALUE_STRING)

I tried in different way and able to crack and get the schema 我以不同的方式尝试并能够破解并获得模式

val files = sc.wholeTextFiles("file:///home/vm/Downloads/emp.json")        
val jsonData = files.map(x => x._2)
sqlc.jsonRDD(jsonData).registerTempTable("employee")
val emp= sqlc.sql("select Employees[1].userId as ID,Employees[1].jobTitleName as Title,Employees[1].firstName as FirstName,Employees[1].lastName as LastName,Employees[1].preferredFullName as PeferedName,Employees[1].employeeCode as empCode,Employees[1].region as Region,Employees[1].phoneNumber as Phone,Employees[1].emailAddress as email from employee")
emp.show // displays all the values

I am able to get the data and schema seperately for each record but I am missing an idea to get all the data and load into hive. 我能够分别获取每个记录的数据和架构,但是我缺少一个获取所有数据并将其加载到配置单元中的想法。

Any help or suggestion is much appreaciated. 任何帮助或建议都非常感谢。

Here is the Cracked answer 这是破解的答案

val files = sc.wholeTextFiles("file:///home/vm/Downloads/emp.json")
val jsonData = files.map(x => x._2)
import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive.HiveContext
val hc=new HiveContext(sc)
hc.jsonRDD(jsonData).registerTempTable("employee")
val fuldf=hc.jsonRDD(jsonData)
val dfemp=fuldf.select(explode(col("Employees")))
dfemp.saveAsTable("empdummy")
val df=sql("select * from empdummy")
df.select ("_c0.userId","_c0.jobTitleName","_c0.firstName","_c0.lastName","_c0.preferredFullName","_c0.employeeCode","_c0.region","_c0.phoneNumber","_c0.emailAddress").saveAsTable("dummytab")

Any suggestion for optimising the above code. 关于优化上述代码的任何建议。

SparkSQL only supports reading JSON files when the file contains one JSON object per line . 仅当文件每行包含一个JSON对象时,SparkSQL仅支持读取JSON文件。

SQLContext.scala SQLContext.scala

  /**
   * Loads a JSON file (one object per line), returning the result as a [[DataFrame]].
   * It goes through the entire dataset once to determine the schema.
   *
   * @group specificdata
   * @deprecated As of 1.4.0, replaced by `read().json()`. This will be removed in Spark 2.0.
   */
  @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
  def jsonFile(path: String): DataFrame = {
    read.json(path)
  }

Your file should look like this (strictly speaking, it's not a proper JSON file) 您的文件应如下所示(严格来说,这不是正确的JSON文件)

{"userId":"rirani","jobTitleName":"Developer","firstName":"Romin","lastName":"Irani","preferredFullName":"Romin Irani","employeeCode":"E1","region":"CA","phoneNumber":"408-1234567","emailAddress":"romin.k.irani@gmail.com"}
{"userId":"nirani","jobTitleName":"Developer","firstName":"Neil","lastName":"Irani","preferredFullName":"Neil Irani","employeeCode":"E2","region":"CA","phoneNumber":"408-1111111","emailAddress":"neilrirani@gmail.com"} 
{"userId":"thanks","jobTitleName":"Program Directory","firstName":"Tom","lastName":"Hanks","preferredFullName":"Tom Hanks","employeeCode":"E3","region":"CA","phoneNumber":"408-2222222","emailAddress":"tomhanks@gmail.com"}

Please have a look at the outstanding JIRA issue . 请查看未解决的JIRA问题 Don't think it is that much of priority, but just for record. 不要以为有那么多优先事项,而只是记录在案。

You have two options 你有两个选择

  1. Convert your json data to the supported format, one object per line 将您的json数据转换为支持的格式,每行一个对象
  2. Have one file per JSON object - this will result in too many files. 每个JSON对象只有一个文件-这将导致文件过多。

Note that SQLContext.jsonFile is deprecated, use SQLContext.read.json . 需要注意的是SQLContext.jsonFile已被弃用,使用SQLContext.read.json

Examples from spark documentation Spark文档中的示例

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM