简体   繁体   中英

Module not found in AWS EMR slave nodes

I am currently running spark-submit jobs on an AWS EMR cluster. I started running into python package issues where a module is not found in during imports.

One obvious solution would be to go into each individual node and install my dependencies. I would like to avoid this if possible. Another solution I can do is write a bootstrap script and create a new cluster.

Last solution that seems to work is I can also pip install my dependencies and zip them and pass them through the spark-submit job through --py-files . Though that may start becoming cumbersome as my requirements increase.

Any other suggestions or easy fixes I may be overlooking?

bootstrap is the solution. write a shell script, pip install all your required packages and put it in the bootstrap option. It will be executed on all nodes when you create a cluster. just keep in mind that if the bootstrap takes too long time (1 hour or so?), it will fail.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM