简体   繁体   中英

Cannot create a dataframe in pyspark and write it to Hive table

I am trying to create a dataframe in pyspark, then write it as a Hive table, and then read it back, but it is not working...

sqlContext = HiveContext(sc)

hive_context = HiveContext(sc) #Initialize Hive

#load the control table 
cntl_dt = [('2016-04-30')]
rdd = sc.parallelize(cntl_dt)
row_cntl_dt = rdd.map(lambda x: Row(load_dt=x[0]))
df_cntl_dt = sqlContext.createDataFrame(row_cntl_dt)
df_cntl_dt.write.mode("overwrite").saveAsTable("schema.cntrl_tbl")
load_dt  = hive_context.sql("select load_dt  from schema.cntrl_tbl" ).first()['load_dt'];
print (load_dt)

Prints: 2

I expect :2016-12-31

This is because:

cntl_dt = [('2016-04-30')]

is not a valid syntax for a single element tuple . Quotes will be ignored and result will be the same as:

['2016-04-30']

and

Row(load_dt=x[0])

will give:

Row(load_dt='2')

Use:

cntl_dt = [('2016-04-30', )]

Also you're mixing different context ( SQLContext and HiveContext ) which is generally a bad idea (and both shouldn't be used in any recent Spark version)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM