在Azure Databricks Job中运行python包.egg

Question

Using build tool (setuptools) packaged my python code as .egg format. 使用构建工具（setuptools）将我的python代码打包为.egg格式。 I wanted to run this package through job in azure data-bricks. 我想通过azure数据砖中的工作来运行这个包。

I can able to execute the package in my local machine through below commands. 我可以通过以下命令在本地机器上执行包。

spark-submit --py-files ./dist/hello-1.0-py3.6.egg hello/pi.py

1) Copied the package into DBFS path as follows, 1）将包复制到DBFS路径中，如下所示，

work-space -> User -> Create -> Library -> Library Source (DBFS) -> Library Type (Python Egg) -> Uploaded

2) Created a job with task as spark-submit on new cluster mode 2）在新的集群模式下创建一个任务作为spark-submit的作业

3) Below parameters are configured for the task, 3）为任务配置下面的参数，

["--py-files","dbfs:/FileStore/jars/8c1231610de06d96-hello_1_0_py3_6-70b16.egg","hello/pi.py"]

Actual: /databricks/python/bin/python: can't open file '/databricks/driver/hello/hello.py': [Errno 2] No such file or directory 实际：/ databricks / python / bin / python：无法打开文件'/databricks/driver/hello/hello.py'：[Errno 2]没有这样的文件或目录

Expected: Job should execute successfully. 预期：作业应该成功执行。

Answer 1

The only way I've got this to work is by using the API to create a Python Job . 我使用它的唯一方法是使用API创建Python作业。 The UI does not support this for some reason. 由于某种原因，UI不支持此功能。

I use PowerShell to work with the API - this is an example that creates a job using an egg which works for me: 我使用PowerShell来处理API - 这是一个使用对我有用的鸡蛋创建作业的示例：

$Lib = '{"egg":"LOCATION"}'.Replace("LOCATION", "dbfs:$TargetDBFSFolderCode/pipelines.egg")
$ClusterId = "my-cluster-id"
$j = "sample"
$PythonParameters = "pipelines.jobs.cleansed.$j"
$MainScript = "dbfs:" + $TargetDBFSFolderCode + "/main.py"
Add-DatabricksDBFSFile -BearerToken $BearerToken -Region $Region -LocalRootFolder "./bin/tmp" -FilePattern "*.*"  -TargetLocation $TargetDBFSFolderCode -Verbose
Add-DatabricksPythonJob -BearerToken $BearerToken -Region $Region -JobName "$j-$Environment" -ClusterId $ClusterId `
    -PythonPath $MainScript -PythonParameters $PythonParameters -Libraries $Lib -Verbose

That copies my main.py and pipelines.egg to DBFS then creates a job pointed at them passing in a parameter. 这会将我的main.py和pipelines.egg复制到DBFS，然后创建一个指向它们的作业，传入一个参数。

One annoying thing about eggs on Databricks - you must uninstall and restart the cluster before it picks up any new versions that you deploy. 关于Databricks上的鸡蛋的一个令人讨厌的事情 - 您必须卸载并重新启动群集，然后才能获取您部署的任何新版本。

If you use an engineering cluster this is not an issue. 如果您使用工程集群，这不是问题。

在Azure Databricks Job中运行python包.egg

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-05-31 16:16:55

在Azure Databricks Job中运行python包.egg

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-05-31 16:16:55

解决方案1
0 已采纳 2019-05-31 16:16:55