简体   繁体   English

如何触发提交存储在 GCP 存储桶中的 .py 文件?

[英]How to spark-submit .py file stored in GCP bucket?

I am trying to run this file .py file.我正在尝试运行这个文件 .py 文件。 I have copied the dsgd_mf.py file in the GCP bucket.我已将 dsgd_mf.py 文件复制到 GCP 存储桶中。 The input datafile required is also in my bucket.所需的输入数据文件也在我的存储桶中。 how to spark-submit this and get output?如何触发提交并获得输出? ( https://github.com/LiuShifeng/Matrix_Factor_Python/blob/master/dsgd_mf.py ) ( https://github.com/LiuShifeng/Matrix_Factor_Python/blob/master/dsgd_mf.py )

I have Jupyter notebook running on the GCP and have gcloud SDK installed.我在 GCP 上运行 Jupyter 笔记本并安装了 gcloud SDK。 Other than creating a cluster and running Jupiter notebook, I have not changed anything else yet.除了创建集群和运行 Jupiter notebook 之外,我还没有更改任何其他内容。 I saw some options to do with a .jar file but I don't know and have any .jar file to specify or link.我看到了一些与 .jar 文件有关的选项,但我不知道并且有任何 .jar 文件要指定或链接。 I am new and a quick help would be highly appreciated.我是新来的,快速的帮助将不胜感激。 Kindly visit the link to see the script file.请访问链接以查看脚本文件。 I need help to run this on the Google cloud platform.我需要帮助才能在 Google 云平台上运行它。

Are you running this on Dataproc?你是在 Dataproc 上运行这个吗? If so, you should just be able to submit the pyspark job with something like this:如果是这样,您应该能够使用以下内容提交 pyspark 作业:

gcloud --project={YOUR_CLUSTERS_PROJECT} dataproc jobs submit pyspark \
{GCS_PATH_TO_JOB} \
--cluster {CLUSTER_NAME} \
-- {SPACE_DELIMITED_JOB_ARGUMENTS}

For what it's worth though, using the pyspark jupyter kernel will block the job from starting (ie the logs will say that the job is waiting for resources over and over).尽管如此,使用 pyspark jupyter 内核将阻止作业开始(即日志会说作业一遍又一遍地等待资源)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM