简体   繁体   中英

How to reinstall same version of a wheel on Databricks without cluster restart

I'm developing some python code that would be used as entry points for various wheel-based-workflows on Databricks. Given that it's under development, after I make code changes to test it, I need to build a wheel and deploy on Databricks cluster to run it (I use some functionality that's only available in Databricks runtime so can not run locally).

Here is what I do:

REMOTE_ROOT='dbfs:/user/kash@company.com/wheels'
cd /home/kash/workspaces/project
rm -rf dist

poetry build
whl_file=$(ls -1tr dist/project-*-py3-none-any.whl | tail -1 | xargs basename)
echo 'copying..'     && databricks fs cp --overwrite dist/$whl_file $REMOTE_ROOT
echo 'installing..'  && databricks libraries install --cluster-id 111-222-abcd \
                                                    --whl $REMOTE_ROOT/$whl_file
# ---- I WANT TO AVOID THIS as it takes time ----
echo 'restarting'    && databricks clusters restart --cluster-id 111-222-abcd

# Run the job that uses some modules from the wheel we deployed
echo 'running job..' && dbk jobs run-now --job-id 1234567

Problem is every time I make one line of change I need to restart the cluster which takes 3-4 minutes. And unless I restart the cluster databricks libraries install does not reinstall the wheel.

I've tried updating the version number for the wheel, but then it shows that the cluster has two versions of same wheel installed on the GUI (Compute -> Select-cluster -> Libraries-tab), but on the cluster itself the newer version is actually not installed (verified using ls -l.../site-packages/ ).

What would perfectly suit your requirements is dbx by databricks labs.

Sure, you can look at their source code on Github and try to mimic the same in your code, but that would be way too much work when databricks-dbx (their execute command) already does this for you.

There you can keep making changes to your python code and run dbx execute -task=<the task that you define as a config while still developing in local IDE> --cluster-name=<your all purpose cluster name>

That would take care of creating a whl for it and deploy it to the cluster and start the job for you to test; while still being in your local IDE.

So, you can basically keep changing your whl in development and keep testing on the same running cluster (it will start it on if not running), without restarting, as it does this in a separate context -> See screenshot below from their documentation.

The main page of dbx is here .

This specific section within there, explains this functionality. 在此处输入图像描述

I have just started using dbx and it does make these things very simple.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM