如何将Spark DataFrame插入Hive内部表？

Question

What's the right way to insert DF to Hive Internal table in Append Mode. 在附加模式下将DF插入Hive内部表的正确方法是什么。 It seems we can directly write the DF to Hive using "saveAsTable" method OR store the DF to temp table then use the query. 看来我们可以使用“saveAsTable”方法直接将DF写入Hive，或者将DF存储到临时表，然后使用查询。

df.write().mode("append").saveAsTable("tableName")

OR 要么

df.registerTempTable("temptable") 
sqlContext.sql("CREATE TABLE IF NOT EXISTS mytable as select * from temptable")

Will the second approach append the records or overwrite it? 第二种方法会附加记录还是覆盖它？

Is there any other way to effectively write the DF to Hive Internal table? 有没有其他方法可以有效地将DF写入Hive Internal表？

Answer 1

Neither of the options here worked for me/probably depreciated since the answer was written. 这里没有任何选项适用于我/可能已经贬值，因为答案是写的。

According to the latest spark API docs (for Spark 2.1), it's using the insertInto() method from the DataFrameWriter class 根据最新的spark API文档（适用于Spark 2.1），它使用DataFrameWriter类中的insertInto()方法

I'm using the Python PySpark API but it would be the same in Scala: 我正在使用Python PySpark API，但它在Scala中是相同的：

df.write.insertInto(target_db.target_table,overwrite = False)

The above worked for me. 以上对我有用。

Answer 2

df.saveAsTable("tableName", "append") is deprecated. 不推荐使用df.saveAsTable("tableName", "append") 。 Instead you should the second approach. 相反，你应该采取第二种方法。

sqlContext.sql("CREATE TABLE IF NOT EXISTS mytable as select * from temptable")

It will create table if the table doesnot exist. 如果表不存在，它将创建表。 When you will run your code second time you need to drop the existing table otherwise your code will exit with exception. 当您第二次运行代码时，您需要删除现有表，否则您的代码将退出异常。

Another approach, If you don't want to drop table. 另一种方法，如果你不想丢桌子。 Create a table separately, then insert your data into that table. 单独创建一个表，然后将数据插入该表。

The below code will append data into existing table 以下代码将数据附加到现有表中

sqlContext.sql("insert into table mytable select * from temptable")

And the below code will overwrite the data into existing table 以下代码将数据覆盖到现有表中

sqlContext.sql("insert overwrite table mytable select * from temptable")

This answer is based on Spark 1.6.2. 这个答案基于Spark 1.6.2。 In case you are using other version of Spark I would suggests to check the appropriate documentation. 如果您使用的是其他版本的Spark，我建议您查看相应的文档。

Answer 3

You could also insert and just overwrite the partition you are inserting into and you could do it with dynamic partitioning. 您也可以插入并覆盖正在插入的分区，您可以使用动态分区来完成。

spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")

temp_table = "tmp_{}".format(table)
df.createOrReplaceTempView(temp_table)
spark.sql("""
    insert overwrite table `{schema}`.`{table}`
    partition (partCol1, partCol2)
      select col1       
           , col2       
           , col3       
           , col4   
           , partCol1
           , partCol2
    from {temp_table}
""".format(schema=schema, table=table, temp_table=temp_table))

如何将Spark DataFrame插入Hive内部表？

问题描述

3 个解决方案

解决方案1
11 2017-07-11 22:07:33

解决方案2
5 已采纳 2017-02-14 09:59:04

解决方案3
0 2019-07-02 16:33:51

如何将Spark DataFrame插入Hive内部表？

问题描述

3 个解决方案

解决方案1 11 2017-07-11 22:07:33

解决方案2 5 已采纳 2017-02-14 09:59:04

解决方案3 0 2019-07-02 16:33:51

解决方案1
11 2017-07-11 22:07:33

解决方案2
5 已采纳 2017-02-14 09:59:04

解决方案3
0 2019-07-02 16:33:51