简体   繁体   中英

How to create a RDD from RC file using data which is partitioned in the hive table

CREATE TABLE employee_details(                                                        
emp_first_name varchar(50),
emp_last_name varchar(50),
emp_dept varchar(50)
)
PARTITIONED BY (
emp_doj varchar(50),
emp_dept_id int  )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'                                 
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'                                       
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat';

Location of the hive table stored is /data/warehouse/employee_details

I have a hive table employee loaded with data and is partitioned by emp_doj ,emp_dept_id and FileFormat is RC file format.

I would like to process the data in the table using the spark-sql without using the hive-context(simply using sqlContext).

Could you please help me in how to load partitioned data of the hive table into an RDD and convert to DataFrame

If you are using Spark 2.0, you can do it in this way.

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate()

import spark.implicits._
import spark.sql

// Queries are expressed in HiveQL
sql("SELECT * FROM src").show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM