简体   繁体   中英

installing requirements of pyspark job before spark-submit

I want to run a Python application on a Spark cluster, sending it there via spark-submit . The application has several dependencies, such as pandas , numpy , scikit-learn . What is a clean way to ensure that the dependencies are installed before submitting the job?

As I have used virtualenv for development, a requirements.txt can easily be generated.

You have to run the job in cluster mode. Assuming you are using Yarn as the scheduler.

spark-submit --master yarn-cluster my_script.py --py-files my_dependency.zip

Also try the following

from sklearn import grid_search, datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV

Regarding pandas, If you have the exact Data structure you can call toPandas()

numpy is generally integrated into a lot of pyspark calls but not sure of this though.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM