简体   繁体   English

如何将文件从谷歌云加载到工作

[英]how to load file from google cloud to job

i stored file on drive "/content/drive/My Drive/BD-CW2" filename pickleRdd same as job read_rdd.py我将文件存储在驱动器“/content/drive/My Drive/BD-CW2”文件名 pickleRdd 与作业 read_rdd.py 相同

but when i run job on cluster im getting但是当我在集群上运行作业时,我得到了

Traceback (most recent call last): File "/tmp/18dcd2bf5c104f01b6d25ea6919b7cfc/read_rdd.py", line 55, in read_RDD(sys.argv[1:]) File "/tmp/18dcd2bf5c104f01b6d25ea6919b7cfc/read_rdd.py", line 32, in read_RDD回溯(最后一次调用):文件“/tmp/18dcd2bf5c104f01b6d25ea6919b7cfc/read_rdd.py”,第 55 行,在 read_RDD(sys.argv[1:]) 文件“/tmp/18dcd2bf5c104f01b6d25ea6919b7cfc/read_rdd.py”,第 32 行,在读取_RDD

code to read file inside job在作业中读取文件的代码

RDDFromPickle  = open('pickleRdd', 'rb')

RDDFromPickle = pickle.load(RDDFromPickle)

how I can redirect above code it to read from drive(/content/drive/My Drive/BD-CW2)?我如何将上面的代码重定向到从驱动器(/content/drive/My Drive/BD-CW2)读取? or move file from drive to cluster so job can access it?或将文件从驱动器移动到集群以便作业可以访问它? all work fine when i run on colab only cannot access when i run on cluster当我在 colab 上运行时一切正常,只有当我在集群上运行时无法访问

easiet way seems be to adjust最简单的方法似乎是调整

 RDDFromPickle  = open('/content/drive/My Drive/BD-CW2/pickleRdd', 'rb')

but how i can pass google drive location?但我怎样才能通过谷歌驱动器位置?

Since you are using Google Cloud Platform, I guess you are deploying your pyspark file to Cloud Dataproc.由于您使用的是 Google Cloud Platform,我猜您正在将 pyspark 文件部署到 Cloud Dataproc。 If so, I suggest to upload your file to a buket in Google Cloud Storage and read from there this file using the code as follows (guess it's a CSV file):如果是这样,我建议将您的文件上传到 Google Cloud Storage 中的 buket 并使用以下代码从那里读取此文件(猜测它是 CSV 文件):

from pyspark.sql import SparkSession

spark = SparkSession \
   .builder \
   .appName('dataproc-python-demo') \
   .getOrCreate()

df = spark.read.format("csv").option("header", 
     "false").load("gs://<bucket>/file.csv")

count_value = df.rdd.map(lambda line: (line._c0, line._c1)).count()

print(count_value)

In the code above it created a Dataframe and I turned it into RDD type to format the values, but you can also use the Dataframe type to do it.在上面的代码中,它创建了一个 Dataframe,我将其转换为 RDD 类型来格式化值,但您也可以使用 Dataframe 类型来执行此操作。

Note that _c0 and _c1 is the default name of the columns it gets when the CSV files have no header.请注意,当 CSV 文件没有 header 时,_c0 和 _c1 是它获取的列的默认名称。 Once you got a similar code like this, you can submit it to your dataproc cluster this way:获得类似这样的代码后,您可以通过以下方式将其提交到您的 dataproc 集群:

gcloud dataproc jobs submit pyspark --cluser <cluster_name> --region 
<region, example us-central1> gs://<bucket>/yourpyfile.py

In order to submit a new job in Dataproc you can refer to this link [1].要在 Dataproc 中提交新作业,您可以参考此链接 [1]。

[1] https://cloud.google.com/dataproc/docs/guides/submit-job#submitting_a_job [1] https://cloud.google.com/dataproc/docs/guides/submit-job#submitting_a_job

Use module os with abspath as follows:使用带有abspath的模块os ,如下所示:

import os.path
RDDFromPickle = open(os.path.abspath('/content/drive/My Drive/BD-CW2/pickleRdd', 'rb'))
RDDFromPickle = pickle.load(RDDFromPickle)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将文件从谷歌云存储加载到谷歌云功能 - How to load a file from google cloud storage to google cloud function 如何从谷歌云存储在 python 上加载 .pickle 文件 - How to load a .pickle file on python from google cloud storage 如何从AI Platform作业访问Google Cloud Storage Bucket - How to access Google Cloud Storage Bucket from AI Platform job 如何从 Google Cloud Storage 存储桶加载保存在 joblib 文件中的模型 - How to load a model saved in joblib file from Google Cloud Storage bucket 将数据从Google Cloud Storage上的本地文件加载到BigQuery表 - Load data from local file on Google Cloud Storage to BigQuery table 如何从谷歌云存储桶下载文件? - How to download file from google cloud bucket? 在google-cloud-ml作业中加载numpy数组 - Load numpy array in google-cloud-ml job 如何将谷歌云存储中的文件打开为云功能 - How to open a file from google cloud storage into a cloud function 如何将预训练的 Tensorflow model 从 Google Cloud Storage 加载到 Datalab - How to load pretrained Tensorflow model from Google Cloud Storage into Datalab 需要使用 python 从云存储将文件从本地驱动器加载到谷歌云平台 - need to load the file from local drive to Google cloud platfrom cloud storage using python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM