简体   繁体   中英

Spark jobs running only on master

I have several python jobs that I need to execute on with spark. The python code doesn't use any spark specific distributed libraries though. It just uses pandas, scipy, and sklearn to manipulate data.

I submit the jobs to spark with the command: spark-submit --master spark://ip:7077 python_code.py

When I submit several of such jobs, all of the jobs execute only on master. The CPU on master goes to 100%, but the workeer nodes are all idle. What I would think is that spark's resource manager would distribute the load across the cluster.

I know that my code doesn't use any of the distributed libraries provided by spark, but is there a way to distribute complete jobs to different nodes?

Without spark action APIs(collect/take/first/saveAsTextFile) nothing will be executed on executors. Its not possible to distribute plain python code just by submitting to spark.

You can check other parallel processing libs like dask ( https://github.com/dask/dask ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM