简体   繁体   English

在分区 hive 表中插入 spark Dataframe 而不覆盖数据

[英]Insert spark Dataframe in partitioned hive table without overwrite the data

I have one dataframe created from a partition table.我有一个从分区表创建的 dataframe。

I need to insert this data frame in an already created partitioned hive table without overwriting the previous data.我需要将此数据框插入已创建的分区 hive 表中,而不覆盖以前的数据。

I use partitionBy("columnname"),insertInto("hivetable") but it give me issue of partitionBy and intsertInto cant use at same time.我使用partitionBy("columnname"),insertInto("hivetable")但它给了我 partitionBy 和 intsertInto 不能同时使用的问题。

You can't do partitionBy with the insertInto operator.您不能使用insertInto运算符进行partitionBy PartitionBy partitions the existing data into multiple hive partitions. PartitionBy 将现有数据分区为多个 hive 分区。 The insertInto is used to insert data into a predefined partition. insertInto用于将数据插入到预定义的分区中。

Therefore, You can do something like this因此,你可以做这样的事情

spark.range(10)
.withColumn("p1", 'id % 2)
.write
.mode("overwrite")
.partitionBy("p1")
.saveAsTable("partitioned_table")

val insertIntoQ = sql("INSERT INTO TABLE 
partitioned_table PARTITION (p1 = 4) VALUES 41, 42")

If you require partitions to be added dynamically then you would need to set the hive.exec.dynamic.partition .如果您需要动态添加分区,则需要设置hive.exec.dynamic.partition

hiveContext.setConf("hive.exec.dynamic.partition", "true")

hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

I faced similar problem during data ingestion, I did something like我在数据摄取过程中遇到了类似的问题,我做了类似的事情

df.write().mode(SaveMode.Append).partitionBy("colname").saveAsTable("Table")

When you use insertInto there is no need to add PartitionBy or BucketBy in the code.使用 insertInto 时,无需在代码中添加 PartitionBy 或 BucketBy。 This should be defined in the table creation request.这应该在表创建请求中定义。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将配置单元分区表加载到Spark Dataframe - Load hive partitioned table to Spark Dataframe 在 Hive 中将 Spark 数据帧另存为动态分区表 - Save Spark dataframe as dynamic partitioned table in Hive 无法从spark插入到hive分区表中 - unable to insert into hive partitioned table from spark 如何在不重复的情况下将Spark DataFrame插入Hive内部表? - How to insert Spark DataFrame to Hive Internal table without duplicating? 如果在Hive表中存在多个分区,则Spark SQL(通过HiveContext进行Hive查询)插入覆盖不会覆盖现有数据 - Spark SQL(Hive query through HiveContext) INSERT OVERWRITE is not overwriting existing data if multiple partition is present in hive table 如何将Spark数据帧另存为已分区的Hive表的分区 - How can I save a spark dataframe as a partition of a partitioned hive table 无法从 spark sql 插入到 hive 分区表 - Unable to insert to hive partitioned table from spark sql SPARK 1.6插入现有的Hive表(未分区) - SPARK 1.6 Insert into existing Hive table (non-partitioned) Spark将数据写入分区的Hive表非常慢 - Spark write data into partitioned Hive table very slow 如何插入覆盖 Hive 表而不失败 org.apache.spark.sql.AnalysisException:只能将数据写入具有单个路径的关系。? - How to Insert Overwrite Hive Table without failing with org.apache.spark.sql.AnalysisException: Can only write data to relations with a single path.?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM