Spark HiveContext：插入覆盖从中读取的同一表

Question

I want to apply SCD1 and SCD2 using PySpark in HiveContext. 我想在HiveContext中使用PySpark来应用SCD1和SCD2。 In my approach, I am reading incremental data and target table. 在我的方法中，我正在读取增量数据和目标表。 After reading, I am joining them for upsert approach. 阅读后，我将加入他们的进阶方法。 I am doing registerTempTable on all the source dataframes. 我正在对所有源数据帧执行registerTempTable。 I am trying to write final dataset into target table and I am facing the issue that Insert overwrite is not possible in the table it is read from. 我正在尝试将最终数据集写入目标表，并且面临一个问题，即从读取表中无法进行插入覆盖。

Please suggest some solution for this. 请为此提出一些解决方案。 I do not want to write intermediate data into a physical table and read it again. 我不想将中间数据写入物理表并再次读取。

Is there any property or way to store the final data set without keeping the dependency on the table it is read from. 是否有任何属性或方法可以存储最终数据集，而又不保留对其读取表的依赖性。 This way, It might be possible to overwrite the table. 这样，可能会覆盖表。

Please suggest. 请提出建议。

Answer 1

You should never overwrite a table from which you are reading. 您永远不应覆盖正在读取的表。 It can result in anything between data corruption and complete data loss in case of failure. 在发生故障的情况下，这可能会导致数据损坏到数据完全丢失。

It is also important to point out that correctly implemented SCD2 shouldn't never overwrite a whole table and can be implemented as a (mostly) append operation. 同样重要的是要指出，正确实现的SCD2永远都不应覆盖整个表，而可以将其实现为（主要是）附加操作。 As far as I am aware SCD1 cannot be efficiently implemented without mutable storage, therefore is not a good fit for Spark. 据我所知，没有可变存储就无法有效地实现SCD1，因此它不太适合Spark。

Answer 2

I was going through the documentation of spark and a thought clicked to me when I was checking one property there. 当我检查Spark的文档时，当我检查那里的一个物业时，一个念头响了起来。

As my table was parquet, I used hive meta store to read the data by setting this property to false. 由于我的桌子是镶木地板，因此我使用蜂巢元存储通过将此属性设置为false来读取数据。

hiveContext.conf("spark.sql.hive.convertMetastoreParquet","false")

This solution is working fine for me. 这个解决方案对我来说很好。

Answer 3

DataFrame would not allow insert overwrite to same location or same table you can use below option to solve your problem. DataFrame不允许插入覆盖到相同的位置或相同的表，您可以使用下面的选项来解决您的问题。

Run Hive insert overwrite query on spark/hivecontext but the problem on failure if jobs fail data will get corrupted on that partition so be very carefull. 在spark / hivecontext上运行Hive插入覆盖查询，但是如果作业失败数据失败，则会在该分区上损坏该问题，因此请务必小心。
Other option save into Temp table once job finished , overwrite to target table. 作业完成后，其他选项保存到Temp表中，覆盖到目标表中。
If you still want to use programmatically, you can save dataframe on tmp location and use HDFS I/O to move to target partition location 如果仍要以编程方式使用，则可以将数据帧保存在tmp位置，并使用HDFS I / O移至目标分区位置

Spark HiveContext：插入覆盖从中读取的同一表

问题描述

3 个解决方案

解决方案1
2 2017-09-10 19:02:35

解决方案2
1 已采纳 2017-09-13 16:15:15

解决方案3
-1 2018-02-02 21:53:52

Spark HiveContext：插入覆盖从中读取的同一表

问题描述

3 个解决方案

解决方案1 2 2017-09-10 19:02:35

解决方案2 1 已采纳 2017-09-13 16:15:15

解决方案3 -1 2018-02-02 21:53:52

解决方案1
2 2017-09-10 19:02:35

解决方案2
1 已采纳 2017-09-13 16:15:15

解决方案3
-1 2018-02-02 21:53:52