I have several python jobs that I need to execute on with spark. The python code doesn't use any spark specific distributed libraries though. It just uses pandas, scipy, and sklearn to manipulate data.
I submit the jobs to spark with the command: spark-submit --master spark://ip:7077 python_code.py
When I submit several of such jobs, all of the jobs execute only on master. The CPU on master goes to 100%, but the workeer nodes are all idle. What I would think is that spark's resource manager would distribute the load across the cluster.
I know that my code doesn't use any of the distributed libraries provided by spark, but is there a way to distribute complete jobs to different nodes?
Without spark action APIs(collect/take/first/saveAsTextFile) nothing will be executed on executors. Its not possible to distribute plain python code just by submitting to spark.
You can check other parallel processing libs like dask ( https://github.com/dask/dask ).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.