简体   繁体   中英

Why does spark-submit in YARN cluster mode not find python packages on executors?

I am running a boo.py script on AWS EMR using spark-submit (Spark 2.0).

The file finished successfully when I use

python boo.py

However, it failed when I run

spark-submit --verbose --deploy-mode cluster --master yarn  boo.py

The log on yarn logs -applicationId ID_number shows:

Traceback (most recent call last):
File "boo.py", line 17, in <module>
import boto3
ImportError: No module named boto3

The python and boto3 module I am using is

$ which python
/usr/bin/python
$ pip install boto3
Requirement already satisfied (use --upgrade to upgrade): boto3 in /usr/local/lib/python2.7/site-packages

How do I append this library path so that spark-submit could read the boto3 module?

When you are running spark, part of the code is running on the driver, and part is running on the executors.

Did you install boto3 on the driver only, or on driver + all executors (nodes) which might run your code?

One solution might be - to install boto3 on all executors (nodes)

how to install python modules on Amazon EMR nodes :

How to bootstrap installation of Python modules on Amazon EMR?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM