简体   繁体   中英

How to spark-submit .py file stored in GCP bucket?

I am trying to run this file .py file. I have copied the dsgd_mf.py file in the GCP bucket. The input datafile required is also in my bucket. how to spark-submit this and get output? ( https://github.com/LiuShifeng/Matrix_Factor_Python/blob/master/dsgd_mf.py )

I have Jupyter notebook running on the GCP and have gcloud SDK installed. Other than creating a cluster and running Jupiter notebook, I have not changed anything else yet. I saw some options to do with a .jar file but I don't know and have any .jar file to specify or link. I am new and a quick help would be highly appreciated. Kindly visit the link to see the script file. I need help to run this on the Google cloud platform.

Are you running this on Dataproc? If so, you should just be able to submit the pyspark job with something like this:

gcloud --project={YOUR_CLUSTERS_PROJECT} dataproc jobs submit pyspark \
{GCS_PATH_TO_JOB} \
--cluster {CLUSTER_NAME} \
-- {SPACE_DELIMITED_JOB_ARGUMENTS}

For what it's worth though, using the pyspark jupyter kernel will block the job from starting (ie the logs will say that the job is waiting for resources over and over).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM