使用Airflow将pyspark作业提交给EMR时缺少Python依赖项

Question

We're using a bootstrap script for installing python libraries on the EMR cluster nodes for our Spark jobs. 我们正在使用引导脚本在Spark作业的EMR群集节点上安装python库。 The script looks something like this: 该脚本如下所示：

sudo python3 -m pip install pandas==0.22.0 scikit-learn==0.21.0

Once the cluster is up, we use Airflow's SparkSubmitHook to submit jobs to EMR. 集群启动后，我们使用Airflow的SparkSubmitHook将作业提交给EMR。 We use this configuration to bind pyspark to python3. 我们使用此配置将pyspark绑定到python3。 Problem is, once in a while, when the job starts running, we get ModuleNotFoundError: No module named 'sklearn' error. 问题是，有时，当作业开始运行时，我们会收到ModuleNotFoundError: No module named 'sklearn'错误。 One such stacktrace is like this one below: 一种这样的堆栈跟踪如下所示：

return self.loads(obj)
 File "/mnt1/yarn/usercache/root/appcache/application_1565624418111_0001/container_1565624418111_0001_01_000033/pyspark.zip/pyspark/serializers.py", line 577, in loads
   return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'sklearn'

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)

This issue is sporadic in nature, so out of 10 job submissions it might be happening 2-3 times. 这个问题本质上是零星的，因此在提交的10个职位中，它可能发生2-3次。 We're using EMR 5.23.0. 我们正在使用EMR 5.23.0。 I've tried upgrading to 5.26.0 as well, but same issue persists. 我也尝试过升级到5.26.0，但是同样的问题仍然存在。

If I go to the cluster nodes, and check for that 'missing' package, I can see it's already installed. 如果我转到群集节点，并检查该“缺失”包，则可以看到它已经安装。 So, clearly it's not the issue with bootstrap script. 因此，很明显，引导脚本不是问题。 That leaves me quite confused, because I've no clue whatsoever on what's going on here. 这让我很困惑，因为我对这里发生的事情一无所知。 I'd guess that it's binding to a different python version when the job gets triggered from Airflow, but that's just a shot in the dark. 我猜想当从Airflow触发作业时，它会绑定到不同的python版本，但这只是黑暗中的一枪。 Any help is appreciated. 任何帮助表示赞赏。

Answer 1

similar case may for reference. 类似情况可供参考。 not sure if it work for EMR In hadoop case, the python environment and package should be installed under the user hadoop or spark. 不确定是否适用于EMR在hadoop情况下，应将python环境和软件包安装在用户hadoop或spark下。

if install python package in root or other user environment, similar case like you may happend. 如果在root或其他用户环境中安装python软件包，可能会发生类似情况。

So, try to install your package with same user name of hadoop or spark. 因此，请尝试使用相同的用户名hadoop或spark安装软件包。

Update =============================================== 更新==============================================

I used to install cloudear work bench which similar spark cloud environment. 我以前安装了类似于Cloud Cloud的cloudear工作台。 In that case, the distributed dependency also needed. 在这种情况下，还需要分布式依赖项。

Here is the hyperlink https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_pyspark.html#distributing_dependencies 这是超链接https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_pyspark.html#distributing_dependencies

The keys are: 关键是：

1. install dependencies package in all cloud nodes. 在所有云节点中安装依赖项程序包。
1. set up the conda virtual environment 设置conda虚拟环境
1. set up the pyspark or pyspark3 path environment. 设置pyspark或pyspark3路径环境。
1. deployment the yarn & spark configuration to gateway(the sparksubmit host, or airflow host). 将yarn＆spark配置部署到网关（sparksubmit主机或气流主机）。

Goodluck. 祝好运。

If you feel the answer helpful, pls vote up. 如果您认为答案有帮助，请投票。

Answer 2

One way to resolve your problem could be by changing the way you summit your job to the cluster : 解决问题的一种方法可能是更改将工作提交集群的方式：

Package the code of the step to run (with its dependencies) on aa s3 bucket (using pipenv and pipfiles for example). 打包要在a s3存储桶上运行的步骤的代码（及其依赖项）（例如，使用pipenv和pipfiles）。 The package would look like this : 该软件包如下所示：

<script_to_execute_package>.zip
|- <script_to_execute_main>.py
|-other step files.py
|- ...
|-scikit-learn
    |-scikit-learn files
    | ...
|-pandas
    |- pandas files
    |- ...
|-other packages
    |-other packages files
    |- ...

Instead of using the SparkSubmitHook use a EmrAddStepsOperator (+Sensor +CreateJobFlowOperator). 代替使用SparkSubmitHook，使用EmrAddStepsOperator（+ Sensor + CreateJobFlowOperator）。 Run the step with your packaged Python code. 使用打包的Python代码运行该步骤。 It would be something like this: 就像这样：

step_to_run = [
                {
                    'Name': 'your_step_name',
                    'ActionOnFailure': 'CONTINUE',
                    'HadoopJarStep': {
                        'Jar': 'command-runner.jar',
                        'Args': ["spark-submit", "--master", "yarn", "--deploy-mode", "client", "--py-files", "s3://<script_to_execute_package>.zip", "/tmp/driver.py", "<script_to_execute_main>.py", "", "--arg_to_pass_1", "arg1", "--arg_to_pass_2", "arg2", ...]
                    }
                }
]

some_task = EmrAddStepsOperator(
                task_id='some_task',
                job_flow_id='the_previously_created_job_flow_id',
                aws_conn_id=aws_conn_id,
                steps=extract_check_args_spark_step,
                dag=dag
            )

            some_task_check = EmrStepSensor(
                task_id='task_check_extract_check',
                job_flow_id='the_previously_created_job_flow_id',
                step_id="{{ task_instance.xcom_pull('some_task', key='return_value')[0] }}",
                aws_conn_id=aws_conn_id,
                poke_interval=10,
                dag=dag
            )

Answer 3

After a lot of trial and error, the following snippet worked out fine as a bootstrap script. 经过大量的反复试验，以下代码片段可以很好地作为引导脚本。 The commented out part was also previously included in our scipt, and it caused issues. 注释掉的部分以前也包含在我们的scipt中，它引起了问题。 After removing that portion everything seems to be working fine. 删除该部分后，一切似乎工作正常。

sudo python3 -m pip install --upgrade pip==19.1.1 >> /tmp/out.log

wget https://download-ib01.fedoraproject.org/pub/epel/7/x86_64/Packages/s/spatialindex-1.8.5-1.el7.x86_64.rpm >> /tmp/out.log
sudo yum -y localinstall spatialindex-1.8.5-1.el7.x86_64.rpm >> /tmp/out.log

sudo python3 -m pip install python-dateutil==2.8.0 pandas==0.22.0 pyarrow==0.13.0 scikit-learn==0.21.0 geopy==1.19.0 Shapely==1.6.4.post2 geohash2==1.1 boto3==1.9.183 rtree==0.8.3 geopandas==0.5.0 >> /tmp/out.log

# python3 -m pip install --user python-dateutil==2.8.0 pandas==0.22.0 pyarrow==0.13.0 geopy==1.19.0 Shapely==1.6.4.post2 geohash2==1.1 boto3==1.9.183
# python3 -m pip install --user scikit-learn==0.21.0

One note here, when a job get's submitted through airflow it runs as root user. 需要注意的是，通过气流提交作业时，它以root用户身份运行。 So probably that's why the --user installation doesn't work. 因此，这可能就是--user安装无法正常工作的原因。 Because this scipt gets executed as user hadoop on each EMR node. 因为此密码是作为每个EMR节点上的用户hadoop执行的。

Answer 4

Another solution, if you use LaunchClusterOperator in your DAG file, is to use the "cluster_overrides" property. 如果在DAG文件中使用LaunchClusterOperator，则另一种解决方案是使用“ cluster_overrides”属性。 Then you can just copy the configuration from this Amazon page. 然后，你可以从复制配置此亚马逊网页。 So the result would look like this (mentioning "Configurations" twice is done intentionally): 因此结果看起来像这样（两次故意提到“配置”）：

   LaunchClusterOperator(dag=yourdag, param2="something", cluster_overrides={
       "Configurations": [
         {
           "Classification": "spark-env",
           "Configurations": [
             {
              "Classification": "export",
              "Properties": {"PYSPARK_PYTHON": "/usr/bin/python3"} 
             }
            ]
         }
       ]
      }
     )

使用Airflow将pyspark作业提交给EMR时缺少Python依赖项

问题描述

4 个解决方案

解决方案1
1 2019-08-13 15:23:05

解决方案2
1 2019-08-15 10:18:39

解决方案3
0 已采纳 2019-09-09 06:33:42

解决方案4
0 2019-09-11 09:51:47

使用Airflow将pyspark作业提交给EMR时缺少Python依赖项

问题描述

4 个解决方案

解决方案1 1 2019-08-13 15:23:05

解决方案2 1 2019-08-15 10:18:39

解决方案3 0 已采纳 2019-09-09 06:33:42

解决方案4 0 2019-09-11 09:51:47

解决方案1
1 2019-08-13 15:23:05

解决方案2
1 2019-08-15 10:18:39

解决方案3
0 已采纳 2019-09-09 06:33:42

解决方案4
0 2019-09-11 09:51:47