I want to run a Python application on a Spark cluster, sending it there via spark-submit
. The application has several dependencies, such as pandas
, numpy
, scikit-learn
. What is a clean way to ensure that the dependencies are installed before submitting the job?
As I have used virtualenv for development, a requirements.txt
can easily be generated.
You have to run the job in cluster mode. Assuming you are using Yarn as the scheduler.
spark-submit --master yarn-cluster my_script.py --py-files my_dependency.zip
Also try the following
from sklearn import grid_search, datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
Regarding pandas, If you have the exact Data structure you can call toPandas()
numpy is generally integrated into a lot of pyspark calls but not sure of this though.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.