简体   繁体   中英

PySpark/HIVE: append to an existing table

Really basic question pyspark/hive question:

How do I append to an existing table? My attempt is below

from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf_init = SparkConf().setAppName('pyspark2')
sc = SparkContext(conf = conf_init)
hive_cxt = HiveContext(sc)

import pandas as pd
df = pd.DataFrame({'a':[0,0], 'b':[0,0]})
sdf = hive_cxt.createDataFrame(df)
sdf.write.mode('overwrite').saveAsTable('database.table') #this line works

df = pd.DataFrame({'a':[1,1,1], 'b':[2,2,2]})
sdf = hive_cxt.createDataFrame(df)
sdf.write.mode('append').saveAsTable('database.table') #this line does not work
#sdf.write.insertInto('database.table',overwrite = False) #this line does not work

Thanks! Sam

It seems using option('overwrite') was causing the problem; it drops the table and then recreates a new one. If I do the following, everything works fine:

from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext

conf_init = SparkConf().setAppName('pyspark2')
sc = SparkContext(conf = conf_init)
hive_cxt = HiveContext(sc)
hive_cxt.sql('USE database')

query = """
        CREATE TABLE IF NOT EXISTS table (a int, b int)
        STORED AS parquet

import pandas as pd
df = pd.DataFrame({'a':[0,0], 'b':[0,0]})
sdf = hive_cxt.createDataFrame(df)

query = """
        SELECT *
        FROM   table
df = hive_cxt.sql(query)
df = df.toPandas()
print(df) # successfully pull the data in table

df = pd.DataFrame({'a':[1,1,1], 'b':[2,2,2]})
sdf = hive_cxt.createDataFrame(df)


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM