简体   繁体   中英

How reliable is spark stream join with static databricks delta table

In the databricks there is a cool feature that allows to join a streaming dataframe with a delta table. The cool part is that changes in the delta table are still reflected for a subsequent join results. It works just fine, but I'm curious to know how this works, and what are the limitations here? eg what's the expected update delay? How it changes as the delta table grows? Is it safe to rely on it in production?

Yes, you can rely on this feature (it's really of Spark) - many customers are using it in production. Regarding the other questions - there are multiple aspects here, depending on factors, like, how often table updates, etc.:

  • Because static Delta table isn't cached it's re-read on each join - depending on the cluster configuration, it may not be very bad if you use Delta Caching , so files aren't re-downloaded every time, only new data will be re-downloaded.
  • Read performance could be affected if you have a lot of small files, etc. - it depends on how you're writing into that table & if you do things like OPTIMIZE.
  • Depending on how often the Delta table is updated, you can cache it & periodically refresh it

But really to answer it completely, you need to provide more information specific to your code, use case, etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM