[英]Azure Databricks cluster init script - install python wheel
[英]install python packages using init scripts in a databricks cluster
我已经通过运行以下命令安装了 databricks cli 工具
pip install databricks-cli
。 如果您使用的是 Python 3,请运行 pip3。
然后通过创建 PAT(Databricks 中的个人访问令牌)我运行以下.sh bash 脚本:
# You can run this on Windows as well, just change to a batch files
# Note: You need the Databricks CLI installed and you need a token configued
#!/bin/bash
echo "Creating DBFS direcrtory"
dbfs mkdirs dbfs:/databricks/packages
echo "Uploading cluster init script"
dbfs cp --overwrite python_dependencies.sh dbfs:/databricks/packages/python_dependencies.sh
echo "Listing DBFS direcrtory"
dbfs ls dbfs:/databricks/packages
python_dependencies.sh 脚本
#!/bin/bash
# Restart cluster after running.
sudo apt-get install applicationinsights=0.11.9 -V -y
sudo apt-get install azure-servicebus=0.50.2 -V -y
sudo apt-get install azure-storage-file-datalake=12.0.0 -V -y
sudo apt-get install humanfriendly=8.2 -V -y
sudo apt-get install mlflow=1.8.0 -V -y
sudo apt-get install numpy=1.18.3 -V -y
sudo apt-get install opencensus-ext-azure=1.0.2 -V -y
sudo apt-get install packaging=20.4 -V -y
sudo apt-get install pandas=1.0.3 -V -y
sudo apt update
sudo apt-get install scikit-learn=0.22.2.post1 -V -y
status=$?
echo "The date command exit status : ${status}"
我使用上面的脚本在集群的 init-scripts 中安装 python 库
我的问题是,即使一切似乎都很好并且集群已成功启动,但库没有正确安装。 当我单击集群的库选项卡时,我得到以下信息:
感谢您的帮助和评论。
我根据@RedCricket 的评论找到了解决方案,
#!/bin/bash
pip install applicationinsights==0.11.9
pip install azure-servicebus==0.50.2
pip install azure-storage-file-datalake==12.0.0
pip install humanfriendly==8.2
pip install mlflow==1.8.0
pip install numpy==1.18.3
pip install opencensus-ext-azure==1.0.2
pip install packaging==20.4
pip install pandas==1.0.3
pip install --upgrade scikit-learn==0.22.2.post1
above.sh 文件将安装集群启动时引用的所有 python 依赖项。 因此,重新执行笔记本时不必重新安装库。
根据文档,对于 azure 数据块
https://docs.microsoft.com/en-us/azure/databricks/dev-tools/cli/
# Set up authentication using an Azure AD token
export DATABRICKS_AAD_TOKEN=$(jq .accessToken -r <<< "$(az account get-access-token --resource 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d)")
# Databricks CLI configuration
databricks configure --host "https://https://<databricks-instance>" --aad-token
现在,将脚本文件复制到 databricks 文件系统
databricks fs cp "./cluster-scoped-init-scripts/db_scope_init_script.sh" "dbfs:/databricks/init-scripts/db_scope_init_script.sh"
确保“db_scope_init_script.sh”shell 脚本具有所需的安装命令。
最后,使用 DBFS REST API 配置集群范围的初始化脚本
curl -n -X POST -H 'Content-Type: application/json' -d '{
"cluster_id": "1202-211320-brick1",
"num_workers": 1,
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"cluster_log_conf": {
"dbfs" : {
"destination": "dbfs:/cluster-logs"
}
},
"init_scripts": [ {
"dbfs": {
"destination": "dbfs:/databricks/scripts/db_scope_init_script.sh"
}
} ]
}' https://<databricks-instance>/api/2.0/clusters/edit
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.