简体   繁体   中英

Reading File from Vertex AI and Google Cloud Storage

I am trying to set up a pipeline in GCP/Vertex AI and am having a lot of trouble. The pipeline is being written using Kubeflow Pipelines and has many different components, one thing in particular is giving me trouble however. Eventually I want to launch this from a Cloud Function with the help of the Cloud Scheduler.

The part that is giving me issues is fairly simple and I believe I just need some form of introduction to how I should be thinking about this setup. I simply want to read and write from files (might be.csv, .txt or similar). I imagine that the analog to the filesystem on my local machine in GCP is the Cloud Storage so this is where I have been trying to read from for the time being (please correct me if I'm wrong). The component I've built is a blatant rip-off of this post and looks like this.

@component(
    packages_to_install=["google-cloud"],
    base_image="python:3.9"
)


def main(
):
    import csv
    from io import StringIO

    from google.cloud import storage

    BUCKET_NAME = "gs://my_bucket"

    storage_client = storage.Client()
    bucket = storage_client.get_bucket(BUCKET_NAME)

    blob = bucket.blob('test/test.txt')
    blob = blob.download_as_string()
    blob = blob.decode('utf-8')

    blob = StringIO(blob)  #tranform bytes to string here

    names = csv.reader(blob)  #then use csv library to read the content
    for name in names:
        print(f"First Name: {name[0]}")

The error I'm getting looks like the following:

google.api_core.exceptions.NotFound: 404 GET https://storage.googleapis.com/storage/v1/b/gs://pipeline_dev?projection=noAcl&prettyPrint=false: Not Found

What's going wrong in my brain? I get the feeling that it shouldn't be this difficult to read and write files. I must be missing something fundamental? Any help is highly appreciated.

Try specifying bucket name w/oa gs://. This should fix the issue. One more stackoverflow post that says the same thing: Cloud Storage python client fails to retrieve bucket

any storage bucket you try to access in GCP has a unique address to access it. That address starts with a gs:// always which specifies that it is a cloud storage url. Now, GCS apis are designed such that they need the bucket name only to work with it. Hence, you just pass the bucket name. If you were accessing the bucket via browser you will need the complete address to access and hence the gs:// prefix as well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM