简体   繁体   English

使用 livy 向 AWS EMR 提交具有虚拟环境的 pyspark 作业

[英]submit pyspark job with virtual environment using livy to AWS EMR

I have created an EMR cluster with the below configurations following AWS documentation我已经按照 AWS 文档创建了一个具有以下配置的 EMR 集群

https://aws.amazon.com/premiumsupport/knowledge-center/emr-pyspark-python-3x/ https://aws.amazon.com/premiumsupport/knowledge-center/emr-pyspark-python-3x/

{
    "Classification": "livy-conf",
    "Properties": {
      "livy.spark.deploy-mode": "cluster",
      "livy.impersonation.enabled": "true",
      "livy.spark.yarn.appMasterEnv.PYSPARK_PYTHON": "/usr/bin/python3"
    }
  },

When I submit the pyspark job using livy with the following post request当我使用 livy 提交 pyspark 作业时,发出以下帖子请求

```
payload = {
    'file': self.py_file,
    'pyFiles': self.py_files,
    'name': self.job_name,
    'archives': ['s3://test.test.bucket/venv.zip#venv', 's3://test.test.bucket/requirements.pip'],
    'proxyUser': 'hadoop',
    "conf": {
      "PYSPARK_PYTHON": "./venv/bin/python",
      "spark.yarn.appMasterEnv.PYSPARK_PYTHON": "./venv/bin/python",
      "spark.yarn.executorEnv.PYSPARK_PYTHON": "./venv/bin/python",
      "spark.yarn.appMasterEnv.VIRTUAL_ENV": "./venv/bin/python",
      "spark.yarn.executorEnv.VIRTUAL_ENV": "./venv/bin/python",
      "livy.spark.yarn.appMasterEnv.PYSPARK_PYTHON": "./venv/bin/python",
      "livy.spark.yarn.appMasterEnv.PYSPARK_PYTHON": "./venv/bin/python",
      "spark.pyspark.virtualenv.enabled": "true",
      "spark.pyspark.virtualenv.type": "native",
      "spark.pyspark.virtualenv.requirements": "s3://test.test.bucket/requirements.pip",
      "spark.pyspark.virtualenv.path": "./venv/bin/python"
     }
}

```

I get the following error message:我收到以下错误消息:

```
Log Type: stdout
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
Fatal Python error: Py_Initialize: Unable to get the locale encoding
ImportError: No module named 'encodings'
Current thread 0x00007efc72b57740 (most recent call first)
```

I also tried to change the PYTHONHOME PYTHONPATH to the parent folder of the bin file of python in the virtual environment, but nothing works.我也尝试将PYTHONHOME PYTHONPATH PYTHONHOME虚拟环境中python的bin文件的父文件夹,但是没有任何效果。

```
"spark.yarn.appMasterEnv.PYTHONPATH": "./venv/bin/",
"spark.yarn.executorEnv.PYTHONPATH": "./venv/bin/",
"livy.spark.yarn.appMasterEnv.PYTHONPATH": "./venv/bin/",
"livy.spark.yarn.executorEnv.PYTHONPATH": "./venv/bin/",
#
"spark.yarn.appMasterEnv.PYTHONHOME": "./venv/bin/",
"spark.yarn.executorEnv.PYTHONHOME": "./venv/bin/",
"livy.spark.yarn.appMasterEnv.PYTHONHOME": "./venv/bin/",
"livy.spark.yarn.executorEnv.PYTHONHOME": "./venv/bin/",
```

Error:错误:

Fatal Python error: Py_Initialize: Unable to get the locale encoding ImportError: No module named 'encodings' Current thread 0x00007f7351d53740 (most recent call first):

This is how I created the virtual environment这就是我创建虚拟环境的方式

```
python3 -m venv venv/
source venv/bin/activate
python3 -m pip install -r requirements.pip
deactivate
pushd venv/
zip -rq ../venv.zip *
popd
```

virtual environment structure:虚拟环境结构:

drwxrwxr-x  2   4096 Oct 15 12:37 bin/
drwxrwxr-x  2   4096 Oct 15 12:37 include/
drwxrwxr-x  3   4096 Oct 15 12:37 lib/
lrwxrwxrwx  1      3 Oct 15 12:37 lib64 -> lib/
-rw-rw-r--  1     59 Oct 15 12:37 pip-selfcheck.json
-rw-rw-r--  1     69 Oct 15 12:37 pyvenv.cfg
drwxrwxr-x  3   4096 Oct 15 12:37 share/

bin dir: bin目录:

activate  activate.csh  activate.fish  chardetect  easy_install  easy_install-3.5  pip  pip3  pip3.5  python  python3

lib dir:库目录:

python3.5/site-packages/

Aws support saying it's an ongoing bug. Aws 支持说这是一个持续的错误。

https://issues.apache.org/jira/browse/SPARK-13587 https://issues.apache.org/jira/browse/SPARK-13​​587

https://issues.apache.org/jira/browse/ZEPPELIN-2233 https://issues.apache.org/jira/browse/ZEPPELIN-2233

Any suggestions ?有什么建议 ?

Thanks!谢谢!

I needed to submit a PySpark job with virtual environment.我需要提交一个带有虚拟环境的 PySpark 作业。 To use a virtual virtual-env in EMR with the 5.x distribution I did this :要在 5.x 发行版的 EMR 中使用虚拟 virtual-env,我这样做了:

Go in the root of your code folder (example: /home/hadoop) and run:进入代码文件夹的根目录(例如:/home/hadoop)并运行:

virtualenv -p /usr/bin/python3 <your-venv_name>

source <your-venv_name>/bin/activate

Go into <your-venv_name>/bin and run:进入 <your-venv_name>/bin 并运行:

./pip3 freeze --> ensure that is empty

sudo ./pip3 install -r <CODE FOLDER PATH>/requirements.txt

./pip3 freeze --> ensure that is populated

To submit my job I used (with basic config) this command:为了提交我的工作,我使用了(使用基本配置)这个命令:

spark-submit --conf spark.pyspark.virtualenv.bin.path=<path-to-your-venv_name>- --conf spark.pyspark.python=<path-to-your-venv_name>/bin/python3 --conf spark.pyspark.driver.python=<path-to-your-venv_name>/bin/python3 <path-to-your-main.py>

in the main.py code you also have to specify the "PYSPARK PYTHON" in the environment:在 main.py 代码中,您还必须在环境中指定“PYSPARK PYTHON”:

import os
os.environ["PYSPARK_PYTHON"]="/usr/bin/python3"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM