installing requirements of pyspark job before spark-submit

Question

I want to run a Python application on a Spark cluster, sending it there via spark-submit . The application has several dependencies, such as pandas , numpy , scikit-learn . What is a clean way to ensure that the dependencies are installed before submitting the job?

As I have used virtualenv for development, a requirements.txt can easily be generated.

Answer 1

You have to run the job in cluster mode. Assuming you are using Yarn as the scheduler.

spark-submit --master yarn-cluster my_script.py --py-files my_dependency.zip

Also try the following

from sklearn import grid_search, datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV

Regarding pandas, If you have the exact Data structure you can call toPandas()

numpy is generally integrated into a lot of pyspark calls but not sure of this though.

installing requirements of pyspark job before spark-submit

Question

1 answers

solution1
-1 2018-04-18 20:34:27

installing requirements of pyspark job before spark-submit

Question

1 answers

solution1 -1 2018-04-18 20:34:27

solution1
-1 2018-04-18 20:34:27