简体   繁体   中英

Spark Dataframe issue in overwriting the partition data of Hive table

Below is my Hive table definition:

CREATE EXTERNAL TABLE IF NOT EXISTS default.test2(
id integer,
count integer
)
PARTITIONED BY (
fac STRING,
fiscaldate_str DATE )
STORED AS PARQUET
LOCATION 's3://<bucket name>/backup/test2';

I have the data in hive table as below, (I just inserted sample data)

select * from default.test2

+---+-----+----+--------------+
| id|count| fac|fiscaldate_str|
+---+-----+----+--------------+
|  2|    3| NRM|    2019-01-01|
|  1|    2| NRM|    2019-01-01|
|  2|    3| NRM|    2019-01-02|
|  1|    2| NRM|    2019-01-02|
|  2|    3| NRM|    2019-01-03|
|  1|    2| NRM|    2019-01-03|
|  2|    3|STST|    2019-01-01|
|  1|    2|STST|    2019-01-01|
|  2|    3|STST|    2019-01-02|
|  1|    2|STST|    2019-01-02|
|  2|    3|STST|    2019-01-03|
|  1|    2|STST|    2019-01-03|
+---+-----+----+--------------+

This table is partitioned on two columns (fac, fiscaldate_str) and we are trying to dynamically execute insert overwrite at partition level by using spark dataframes - dataframe writer.

However, when trying this, we are either ending up with duplicate data or all other partitions got deleted.

Below are the codes snippets for this using spark dataframe.

First I am creating dataframe as

df = spark.createDataFrame([(99,99,'NRM','2019-01-01'),(999,999,'NRM','2019-01-01')], ['id','count','fac','fiscaldate_str'])

df.show(2,False)
+---+-----+---+--------------+
|id |count|fac|fiscaldate_str|
+---+-----+---+--------------+
|99 |99   |NRM|2019-01-01    |
|999|999  |NRM|2019-01-01    |
+---+-----+---+--------------+
  1. Getting duplicate with below snippet,

    df.coalesce(1).write.mode("overwrite").insertInto("default.test2")

  2. All other data get deleted and only the new data is available.

    df.coalesce(1).write.mode("overwrite").saveAsTable("default.test2")

OR

   df.createOrReplaceTempView("tempview")

tbl_ald_kpiv_hist_insert = spark.sql("""
INSERT OVERWRITE TABLE default.test2 
partition(fac,fiscaldate_str) 
select * from tempview
""")

I am using AWS EMR with Spark 2.4.0 and Hive 2.3.4-amzn-1 along with S3.

Can anyone have any idea why I am not able to dynamically overwrite the data into partitions ?

Your question is less easy to follow, but I think you mean you want a partition overwritten. If so, then this is what you need, all you need - the second line:

df = spark.createDataFrame([(99,99,'AAA','2019-01-02'),(999,999,'BBB','2019-01-01')], ['id','count','fac','fiscaldate_str'])
df.coalesce(1).write.mode("overwrite").insertInto("test2",overwrite=True) 

Note the overwrite=True. The comment made is neither here nor there, as the DF.writer is being used. I am not addressing the coalesce(1).

Comment to Asker

I ran this as I standardly do - when prototyping and answering here - on a Databricks Notebook and expressly set the following and it worked fine:

spark.conf.set("spark.sql.sources.partitionOverwriteMode","static")
spark.conf.set("hive.exec.dynamic.partition.mode", "strict")

You ask to update the answer with:

spark.conf.set("spark.sql.sources.partitionOverwriteMode","d‌​ynamic").

Can do as I have just done; may be in your environment this is needed, but I did certainly not need to do so.

UPDATE 19/3/20

This worked on prior Spark releases, now the following applie afaics:

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
// In Databricks did not matter the below settings
//spark.conf.set("hive.exec.dynamic.partition", "true")
//spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")

Seq(("CompanyA1", "A"), ("CompanyA2", "A"), 
    ("CompanyB1", "B"))
.toDF("company", "id")
.write
.mode(SaveMode.Overwrite)
.partitionBy("id")
.saveAsTable("KQCAMS9")

spark.sql(s"SELECT * FROM KQCAMS9").show(false)

val df = Seq(("CompanyA3", "A"))
.toDF("company", "id")
// disregard coalsece
df.coalesce(1).write.mode("overwrite").insertInto("KQCAMS9") 

spark.sql(s"SELECT * FROM KQCAMS9").show(false)
spark.sql(s"show partitions KQCAMS9").show(false)

All OK this way now from 2.4.x. onwards.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM