简体   繁体   中英

Spark HiveContext : Insert Overwrite the same table it is read from

I want to apply SCD1 and SCD2 using PySpark in HiveContext. In my approach, I am reading incremental data and target table. After reading, I am joining them for upsert approach. I am doing registerTempTable on all the source dataframes. I am trying to write final dataset into target table and I am facing the issue that Insert overwrite is not possible in the table it is read from.

Please suggest some solution for this. I do not want to write intermediate data into a physical table and read it again.

Is there any property or way to store the final data set without keeping the dependency on the table it is read from. This way, It might be possible to overwrite the table.

Please suggest.

You should never overwrite a table from which you are reading. It can result in anything between data corruption and complete data loss in case of failure.

It is also important to point out that correctly implemented SCD2 shouldn't never overwrite a whole table and can be implemented as a (mostly) append operation. As far as I am aware SCD1 cannot be efficiently implemented without mutable storage, therefore is not a good fit for Spark.

I was going through the documentation of spark and a thought clicked to me when I was checking one property there.

As my table was parquet, I used hive meta store to read the data by setting this property to false.

hiveContext.conf("spark.sql.hive.convertMetastoreParquet","false")

This solution is working fine for me.

DataFrame would not allow insert overwrite to same location or same table you can use below option to solve your problem.

  1. Run Hive insert overwrite query on spark/hivecontext but the problem on failure if jobs fail data will get corrupted on that partition so be very carefull.
  2. Other option save into Temp table once job finished , overwrite to target table.
  3. If you still want to use programmatically, you can save dataframe on tmp location and use HDFS I/O to move to target partition location

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM