I am currently running spark-submit
jobs on an AWS EMR cluster. I started running into python package issues where a module is not found in during imports.
One obvious solution would be to go into each individual node and install my dependencies. I would like to avoid this if possible. Another solution I can do is write a bootstrap script and create a new cluster.
Last solution that seems to work is I can also pip install
my dependencies and zip them and pass them through the spark-submit
job through --py-files
. Though that may start becoming cumbersome as my requirements increase.
Any other suggestions or easy fixes I may be overlooking?
bootstrap is the solution. write a shell script, pip install all your required packages and put it in the bootstrap option. It will be executed on all nodes when you create a cluster. just keep in mind that if the bootstrap takes too long time (1 hour or so?), it will fail.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.