简体   繁体   English

使用合并后如何从 Delta 表中获取新/更新的记录?

[英]How to get new/updated records from Delta table after upsert using merge?

在 spark 流作业中使用合并<\/a>到 Delta 表进行upsert<\/a>后,有什么方法可以获取更新\/插入的行?


val df = spark.readStream(...)
val deltaTable = DeltaTable.forName("...")


def upsertToDelta(events: DataFrame, batchId: Long) {

deltaTable.as("table")
    .merge(
      events.as("event"), 
      "event.entityId == table.entityId")
    .whenMatched()
        .updateExpr(...))
    .whenNotMatched()
      .insertAll()
    .execute()
}

df
  .writeStream
  .format("delta")
  .foreachBatch(upsertToDelta _)
  .outputMode("update")
  .start()

You can enabled Change Data Feed on the table, and then have another stream or batch job to fetch the changes, so you'll able to receive information on what rows has changed/deleted/inserted.您可以在表上启用 更改数据馈送,然后使用另一个流或批处理作业来获取更改,这样您就可以接收有关哪些行已更改/删除/插入的信息。 It could be enabled with:它可以通过以下方式启用:

ALTER TABLE table_name SET TBLPROPERTIES (delta.enableChangeDataFeed = true)

if thable isn't registered, you can use path instead of table name:如果 thable 未注册,您可以使用路径而不是表名:

ALTER TABLE delta.`path` SET TBLPROPERTIES (delta.enableChangeDataFeed = true)

The changes will be available if you add the .option("readChangeFeed", "true") option when reading stream from a table:如果您在从表中读取流时添加.option("readChangeFeed", "true")选项.option("readChangeFeed", "true")则更改将可用:

spark.readStream.format("delta") \
  .option("readChangeFeed", "true") \
  .table("table_name")

and it will add three columns to table describing the change - the most important is _change_type (please note that there are two different types for update operation).它将向表中添加三列描述更改 - 最重要的是_change_type (请注意,更新操作有两种不同的类型)。

If you're worried about having another stream - it's not a problem, as you can run multiple streams inside the same job - you just don't need to use .awaitTermination , but something like spark.streams.awaitAnyTermination() to wait on multiple streams.如果您担心有另一个流 - 这不是问题,因为您可以在同一个作业中运行多个流 - 您不需要使用.awaitTermination ,而是使用类似spark.streams.awaitAnyTermination()东西来等待多个流。

PS But maybe this answer will change if you explain why you need to get changes inside the same job? PS但是,如果您解释为什么需要在同一份工作中进行更改,也许这个答案会改变?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM