简体   繁体   中英

Package Conda Environment with Azure Synapse Spark Jobs

I am trying to use azure synapse for running spark jobs. Based on the documentation, synapse allows to install third party libraries via workspace packages

However, in the azure synapse command reference I can see that it supports archives option Does it mean that I can reference a packaged conda environment from blob storage and ensure that all the necessary packages are available in all the nodes?

To simulate I created a small conda environment with single external package humanize


  conda create -y -n hornet_conda_env
  conda activate hornet_conda_env
  conda install -c conda-forge humanize
  conda pack -f -o hornet_conda_env.tar.gz

The entry spark job file as script.py

  import humanize
  import datetime as dt

  print(humanize.naturalday(dt.datetime.now()))

The script and conda environment package ( hornet_conda_env.tar.gz ) are uploaded to Azure Blob Storage and a spark job is created with the reference to the script.

The spark job definition is invoked via command line using az cli as follows:

az synapse spark job submit \
--workspace-name <workspace_name> \ 
--spark-pool-name <pool_name> 
--executor-size Small 
--executors 2 
--language PySpark 
--main-definition-file abfss://<full_path_for_entry_spark_job>.script.py \
--name <name> \
--archives abfss://<full_path_for_conda_package>/hornet_conda_env.tar.gz

The script execution fails with the error ModuleNotFoundError: No module named 'humanize' meaning that conda environment is not installed as referenced in archives.

Does synapse allow this kind of conda environment packaging and distribution?

Update 1

I am trying to use this package locally and I am unable to use the same as well. I am trying to use this outside the conda environment and I am getting the same issue as ModuleNotFoundError: No module named 'humanize'

PYSPARK_PYTHON=/Users/<username>/opt/anaconda3/bin/python3 \
spark-submit \
--master "local[3]" \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/Users/<username>/opt/anaconda3/bin/python3 \
--archives ~/workspace/cuezen/hornet/hornet_env.tar.gz#environment  \
~/workspace/cuezen/hornet/hornet.py

Follow the below steps to get rid of modules not found error:

  1. Enable virtual environment in your terminal. Below are few commands:

     python -m venv.venv.venv\Scripts\activate #add the module in req.txt to install it. pip install -r requirements.txt
  2. Make sure this more is present in site packages.

  3. If it is not present add it explicitly by comparing it with azure-python-sdk or humanize module.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM