I am trying to insert a dataframe into a Hive table using the following code:
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
val hiveCont = val hiveCont = new org.apache.spark.sql.hive.HiveContext(sc)
val empfile = sc.textFile("empfile")
val empdata = empfile.map(p => p.split(","))
case class empc(id:Int, name:String, salary:Int, dept:String, location:String)
val empRDD = empdata.map(p => empc(p(0).toInt, p(1), p(2).toInt, p(3), p(4)))
val empDF = empRDD.toDF()
empDF.registerTempTable("emptab")
I have a table in Hive with following DDL:
# col_name data_type comment
id int
name string
salary int
dept string
# Partition Information
# col_name data_type comment
location string
I'm trying to insert the temporary table into the hive table as follows:
hiveCont.sql("insert into parttab select id, name, salary, dept from emptab")
This is giving an exception:
org.apache.spark.sql.AnalysisException: Table not found: emptab. 'emptab' is the temp table created from Dataframe
Here I understand that the hivecontext will run the query on 'HIVE' from Spark and it doesn't find the table there, hence resulting exception. But I don't understand how I can fix this issue. Could any tell me how to fix this ?
registerTempTable("emptab")
: This line of code is used to create a table temporary table in spark, not in hive. For storing data to hive, you have to first create a table in hive explicitly. For storing a table value data to hive table, please use the below code:
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
val hiveCont = new org.apache.spark.sql.hive.HiveContext(sc)
val empfile = sc.textFile("empfile")
val empdata = empfile.map(p => p.split(","))
case class empc(id:Int, name:String, salary:Int, dept:String, location:String)
val empRDD = empdata.map(p => empc(p(0).toInt, p(1), p(2).toInt, p(3), p(4)))
val empDF = empRDD.toDF()
empDF.write().saveAsTable("emptab");
You are implicitly converting RDD into dataFrame but you are not importing implicit objects therefore RDD is not getting converted into dataframe. Include below line in import.
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
Also the case classes must be defined top level - they cannot be nested. So your final code should be like this:
import org.apache.spark._
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.DataFrame
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import sqlContext.implicits._
val hiveCont = new org.apache.spark.sql.hive.HiveContext(sc)
case class Empc(id:Int, name:String, salary:Int, dept:String, location:String)
val empFile = sc.textFile("/hdfs/location/of/data/")
val empData = empFile.map(p => p.split(","))
val empRDD = empData.map(p => Empc(p(0).trim.toInt, p(1), p(2).trim.toInt, p(3), p(4)))
val empDF = empRDD.toDF()
empDF.registerTempTable("emptab")
Also trim all white space if you are converting a String
to Integer
. I have included that in the above code as well.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.