简体   繁体   English

在 databricks 集群中使用初始化脚本安装 python 包

[英]install python packages using init scripts in a databricks cluster

I have installed the databricks cli tool by running the following command我已经通过运行以下命令安装了 databricks cli 工具

pip install databricks-cli using the appropriate version of pip for your Python installation. pip install databricks-cli If you are using Python 3, run pip3.如果您使用的是 Python 3,请运行 pip3。

Then by creating a PAT (personal-access token in Databricks) I run the following.sh bash script:然后通过创建 PAT(Databricks 中的个人访问令牌)我运行以下.sh bash 脚本:

# You can run this on Windows as well, just change to a batch files
# Note: You need the Databricks CLI installed and you need a token configued
#!/bin/bash
echo "Creating DBFS direcrtory"
dbfs mkdirs dbfs:/databricks/packages

echo "Uploading cluster init script"
dbfs cp --overwrite python_dependencies.sh                     dbfs:/databricks/packages/python_dependencies.sh

echo "Listing DBFS direcrtory"
dbfs ls dbfs:/databricks/packages

python_dependencies.sh script python_dependencies.sh 脚本

#!/bin/bash
# Restart cluster after running.

sudo apt-get install applicationinsights=0.11.9 -V -y
sudo apt-get install azure-servicebus=0.50.2 -V -y
sudo apt-get install azure-storage-file-datalake=12.0.0 -V -y
sudo apt-get install humanfriendly=8.2 -V -y
sudo apt-get install mlflow=1.8.0 -V -y
sudo apt-get install numpy=1.18.3 -V -y
sudo apt-get install opencensus-ext-azure=1.0.2 -V -y
sudo apt-get install packaging=20.4 -V -y
sudo apt-get install pandas=1.0.3 -V -y
sudo apt update
sudo apt-get install scikit-learn=0.22.2.post1 -V -y
status=$?
echo "The date command exit status : ${status}"

I use the above script to install python libraries in the init-scripts of the cluster我使用上面的脚本在集群的 init-scripts 中安装 python 库

在此处输入图像描述

My problem is that even though everything seems to be fine and the cluster is started successfully, the libraries are not installed properly.我的问题是,即使一切似乎都很好并且集群已成功启动,但库没有正确安装。 When I click on the libraries tab of the cluster I get this:当我单击集群的库选项卡时,我得到以下信息:

在此处输入图像描述 Only 1 out of the 10 python libraries is installed.仅安装了 10 个 python 库中的 1 个。

Appreciate your help and comments.感谢您的帮助和评论。

I have found the solution based on the comment of @RedCricket,我根据@RedCricket 的评论找到了解决方案,

#!/bin/bash

pip install applicationinsights==0.11.9
pip install azure-servicebus==0.50.2
pip install azure-storage-file-datalake==12.0.0
pip install humanfriendly==8.2
pip install mlflow==1.8.0
pip install numpy==1.18.3
pip install opencensus-ext-azure==1.0.2
pip install packaging==20.4
pip install pandas==1.0.3
pip install --upgrade scikit-learn==0.22.2.post1

The above.sh file will install all the python dependencies referenced when the cluster is starting. above.sh 文件将安装集群启动时引用的所有 python 依赖项。 So, the libraries won't have to be re-installed when the notebook is re-executed.因此,重新执行笔记本时不必重新安装库。

For azure databricks as per documentation根据文档,对于 azure 数据块

https://docs.microsoft.com/en-us/azure/databricks/dev-tools/cli/ https://docs.microsoft.com/en-us/azure/databricks/dev-tools/cli/

# Set up authentication using an Azure AD token
export DATABRICKS_AAD_TOKEN=$(jq .accessToken -r <<< "$(az account get-access-token --resource 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d)")
# Databricks CLI configuration 
databricks configure --host "https://https://<databricks-instance>" --aad-token

now, copy script file to databricks file system现在,将脚本文件复制到 databricks 文件系统

databricks fs cp "./cluster-scoped-init-scripts/db_scope_init_script.sh" "dbfs:/databricks/init-scripts/db_scope_init_script.sh"

Make sure "db_scope_init_script.sh" shell script has required installation commands.确保“db_scope_init_script.sh”shell 脚本具有所需的安装命令。

Finally, Configure a cluster-scoped init script using the DBFS REST API最后,使用 DBFS REST API 配置集群范围的初始化脚本

curl -n -X POST -H 'Content-Type: application/json' -d '{
  "cluster_id": "1202-211320-brick1",
  "num_workers": 1,
  "spark_version": "7.3.x-scala2.12",
  "node_type_id": "Standard_D3_v2",
  "cluster_log_conf": {
    "dbfs" : {
      "destination": "dbfs:/cluster-logs"
    }
  },
  "init_scripts": [ {
    "dbfs": {
      "destination": "dbfs:/databricks/scripts/db_scope_init_script.sh"
    }
  } ]
}' https://<databricks-instance>/api/2.0/clusters/edit

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM