简体   繁体   English

spark - scala - 使用覆盖模式将数据帧保存到表中

[英]spark - scala - save dataframe to a table with overwrite mode

I would like to know what exactly "overwrite" does here. 我想知道这里究竟是什么“覆盖”。 Let's say I have a table having the following records in table "tb1"(sorry for bad representation of tables) 假设我在表“tb1”中有一个包含以下记录的表(对于表的错误表示感到抱歉)

driver vin make model 司机vin制作模型

martin abc ford escape
john abd toyota camry
amy abe chevrolet malibu
carlos abf honda civic

Now I have the following dataframe(mydf) with the same columns but with the follwing rows/data 现在我有以下数据帧(mydf)具有相同的列但具有以下行/数据

martin abf toyota corolla
carlos abg nissan versa

After saving the above dataframe to the "tb1" with overwrite mode, will the dataframe entirely delete the contents of "tb1" and write the data of mydf(above two records)? 在使用覆盖模式将上述数据帧保存到“tb1”之后,数据帧是否会完全删除“tb1”的内容并写入mydf的数据(以上两个记录)?

However, I would like the overwrite mode to overwrite only those rows that have same values for column "driver". 但是,我希望覆盖模式仅覆盖列“driver”具有相同值的那些行。 In this case, of 4 records in "tb1", mydf would overwrite only above 2 records and the resultant table would be as follows- 在这种情况下,“tb1”中有4条记录,mydf只会覆盖2条以上的记录,结果表格如下 -

driver vin make model 司机vin制作模型

martin abf toyota corolla
john abd toyota camry
amy abe chevrolet malibu
carlos abg nissan versa

Can I achieve this functionality using overwrite mode? 我可以使用覆盖模式实现此功能吗?

mydf.write.mode(SaveMode.Overwrite).saveAsTable("tb1")

What you meant is merge 2 dataframes on the primary key. 你的意思是在主键上合并2个数据帧。 You want to merge two dataframe and replace the old rows with the new rows and append the extra rows if any present. 您希望合并两个数据框并将旧行替换为新行,并附加额外的行(如果存在)。

This can't be achieved by SaveMode.Overwrite or SaveMode.append. SaveMode.Overwrite或SaveMode.append无法实现这一点。

To do this you need to implement merge functionality of 2 dataframe on the primary key. 为此,您需要在主键上实现2个数据帧的合并功能。

Something like this 像这样的东西

 parentDF = // actual dataframe
 deltaDF = // new delta to be merged


 val updateDF = spark.sql("select parentDF.* from parentDF join deltaDF on parentDF.id = deltaDF.id")
 val totalDF = parentDF.except(updateDF).union(deltaDF)
 totalDF.write.mode(SaveMode.Overwrite).saveAsTable("tb1")

Answering your question: 回答你的问题:

Can I achieve this functionality using overwrite mode? 我可以使用覆盖模式实现此功能吗?

No, you can't. 不,你不能。

What function Overwrite does is practically, delete all the table that you want to populate and create it again but now with the new DataFrame that you are telling it. Overwrite的功能实际上是,删除所有要填充的表并再次创建它,但现在使用您正在告诉它的新DataFrame。

To get the result you want, you would do the following: 要获得所需的结果,您将执行以下操作:

  • Save the information of your table to "update" into a new DataFrame: 保存表格的信息以“更新”到新的DataFrame中:

    val dfTable = hiveContext.read.table("table_tb1") val dfTable = hiveContext.read.table(“table_tb1”)

  • Do a Left Join between your DF of the table to update (dfTable), and the DF (mydf) with your new information, crossing by your "PK", that in your case, will be the driver column. 做了您的DF之间加入表与新的信息来更新(dfTable)和DF(myDF上),你的“PK”穿越,你的情况,将成为驱动列。

In the same sentence, you filter the records where mydf("driver") column is null , that are the ones that are not matching and there is no update for these ones. 在同一句子中,您过滤mydf(“driver”)列为 空的记录 ,即不匹配的记录,并且没有这些记录的更新。

val newDf = dfTable.join(mydf, dfTable("driver") === mydf("driver"), "leftouter" ).filter(mydf("driver").isNull)
  • After that, Truncate your table tb1 and insert both DataFrames: the newDF and mydf DataFrames: 之后,截断你的表tb1并插入两个DataFrames: newDFmydf DataFrames:

| |

dfArchivo.write.mode(SaveMode.Append).insertInto("table_tb1")  /** Info with no changes */
mydf.write.mode(SaveMode.Append).insertInto("table_tb1") /** Info updated */

In that way, you can get the result you are looking for. 这样,您就可以获得所需的结果。

Regards. 问候。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM