简体   繁体   中英

Insert spark Dataframe in partitioned hive table without overwrite the data

I have one dataframe created from a partition table.

I need to insert this data frame in an already created partitioned hive table without overwriting the previous data.

I use partitionBy("columnname"),insertInto("hivetable") but it give me issue of partitionBy and intsertInto cant use at same time.

You can't do partitionBy with the insertInto operator. PartitionBy partitions the existing data into multiple hive partitions. The insertInto is used to insert data into a predefined partition.

Therefore, You can do something like this

spark.range(10)
.withColumn("p1", 'id % 2)
.write
.mode("overwrite")
.partitionBy("p1")
.saveAsTable("partitioned_table")

val insertIntoQ = sql("INSERT INTO TABLE 
partitioned_table PARTITION (p1 = 4) VALUES 41, 42")

If you require partitions to be added dynamically then you would need to set the hive.exec.dynamic.partition .

hiveContext.setConf("hive.exec.dynamic.partition", "true")

hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

I faced similar problem during data ingestion, I did something like

df.write().mode(SaveMode.Append).partitionBy("colname").saveAsTable("Table")

When you use insertInto there is no need to add PartitionBy or BucketBy in the code. This should be defined in the table creation request.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM