简体   繁体   English

如何使用Spark执行插入覆盖?

[英]How can you perform a Insert overwrite using Spark?

I'm trying to transition one of our ETL Hive script to Spark where the Hive ETL script maintains a table where part of data needs to be deleted every night before the new sync. 我正在尝试将我们的ETL Hive脚本之一转换为Spark,其中Hive ETL脚本维护着一个表,该表中的部分数据需要在每晚进行新同步之前删除。 The Hive ETL takes the main table deletes data that in greater than 3 days using insert overwrite. Hive ETL使主表使用插入覆盖删除超过3天的数据。 Basically creates a temp table with data that doesn't surpass greater than three days and then overwrites the main table. 基本上用不超过三天的数据创建一个临时表,然后覆盖主表。

With Spark (using Scala) I keep getting this error where I cannot write to the same source. 使用Spark(使用Scala)时,在无法写入同一源代码的情况下,我不断收到此错误。 Here's my code: 这是我的代码:

spark.sql ("Select * from mytbl_hive where dt > date_sub(current_date, 3)").registerTempTable("tmp_mytbl")

val mytbl = sqlContext.table("tmp_mytbl")
mytbl.write.mode("overwrite").saveTableAs("tmp_mytbl")

//writing back to Hive ...

mytbl.write.mode("overwrite").insertInto("mytbl_hive")

I get the error that I cannot write to the table I'm reading from. 我收到无法写入正在读取的表的错误。

Does anyone know of a better way of doing this? 有人知道这样做的更好方法吗?

You cannot. 你不能。 As you've learned Spark explicitly prohibits overwriting table, which is used as a source for the query. 如您所知,Spark明确禁止覆盖表,该表用作查询的源。 While there exist some workarounds depending on the technicalities, there are not reliable and should be avoided. 尽管存在一些变通办法,具体取决于技术性,但仍不可靠,应避免使用。

Instead: 代替:

  • Write data to a temporary table. 将数据写入临时表。
  • Drop old table. 放下旧桌子。
  • Rename temporary table. 重命名临时表。

The Hive ETL takes the main table deletes data that in greater than 3 days using insert overwrite. Hive ETL使主表使用插入覆盖删除超过3天的数据。

It might a better idea to partition data by date, and just drop partitions, without even looking at the data. 最好按日期对数据进行分区,然后删除分区,甚至不查看数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM