简体   繁体   中英

In databricks how to automate notebook runs

I have multiple datasets that are updated inconsistently in databricks: datasets database.A , database.B , database.C .

  • database.A : is updated the first of every month (ie 1/1/2022, 2/1/2022, etc.), but sometimes has midsession updates (ie 3/14/2022, 4/12/2022, etc.)
  • database.B : is updated the fifth of every month
  • database.C : is updated the first of every quarter (ie 1/1/2022, 4/1/2022, etc.), but sometimes has a midsession update (ie 5/1/2022, etc.)

My goal is to create a notebook that runs processes when the data is updated in any of these datasets. For example:

data.updated.A <- some_code_or_function(database.A)
data.updated.B <- some_code_or_function(database.B)
data.updated.C <- some_code_or_function(database.C)

case when data.updated.A = TRUE or data.updated.B = TRUE or data.updated.C = TRUE then run_notebook else do_nothing_and_send_signal_1_day_from_now

Any ideas? Full disclosure, I am relatively new to databricks so I may not know if I need to switch from SQL to scala, python, or R and am fully willing to. Should I consider another tactic besides scheduled processes?Thanks.

You can run the notebook as a job and run it based on corn: https://docs.databricks.com/jobs.html#create-a-job

If you are deploying your notebooks using Terraform you can this module that I wrote: https://github.com/tomarv2/terraform-databricks-workspace-management

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM