在分区 hive 表中插入 spark Dataframe 而不覆盖数据

Question

I have one dataframe created from a partition table.我有一个从分区表创建的 dataframe。

I need to insert this data frame in an already created partitioned hive table without overwriting the previous data.我需要将此数据框插入已创建的分区 hive 表中，而不覆盖以前的数据。

I use partitionBy("columnname"),insertInto("hivetable") but it give me issue of partitionBy and intsertInto cant use at same time.我使用partitionBy("columnname"),insertInto("hivetable")但它给了我 partitionBy 和 intsertInto 不能同时使用的问题。

Answer 1

You can't do partitionBy with the insertInto operator.您不能使用insertInto运算符进行partitionBy 。 PartitionBy partitions the existing data into multiple hive partitions. PartitionBy 将现有数据分区为多个 hive 分区。 The insertInto is used to insert data into a predefined partition. insertInto用于将数据插入到预定义的分区中。

Therefore, You can do something like this因此，你可以做这样的事情

spark.range(10)
.withColumn("p1", 'id % 2)
.write
.mode("overwrite")
.partitionBy("p1")
.saveAsTable("partitioned_table")

val insertIntoQ = sql("INSERT INTO TABLE 
partitioned_table PARTITION (p1 = 4) VALUES 41, 42")

If you require partitions to be added dynamically then you would need to set the hive.exec.dynamic.partition .如果您需要动态添加分区，则需要设置hive.exec.dynamic.partition 。

hiveContext.setConf("hive.exec.dynamic.partition", "true")

hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

I faced similar problem during data ingestion, I did something like我在数据摄取过程中遇到了类似的问题，我做了类似的事情

df.write().mode(SaveMode.Append).partitionBy("colname").saveAsTable("Table")

Answer 2

When you use insertInto there is no need to add PartitionBy or BucketBy in the code.使用 insertInto 时，无需在代码中添加 PartitionBy 或 BucketBy。 This should be defined in the table creation request.这应该在表创建请求中定义。

在分区 hive 表中插入 spark Dataframe 而不覆盖数据

问题描述

2 个解决方案

解决方案1
0 2019-09-25 16:12:13

解决方案2
0 2022-02-09 14:15:51

在分区 hive 表中插入 spark Dataframe 而不覆盖数据

问题描述

2 个解决方案

解决方案1 0 2019-09-25 16:12:13

解决方案2 0 2022-02-09 14:15:51

解决方案1
0 2019-09-25 16:12:13

解决方案2
0 2022-02-09 14:15:51