简体   繁体   English

从 Google Cloud Storage 存储桶复制到 S3 存储桶

[英]Copy from Google Cloud Storage Bucket to S3 Bucket

I have set up an airflow workflow that ingests some files from s3 to Google Cloud storage and then runs a workflow of sql queries to create new tables on Big Query.我设置了一个气流工作流,将一些文件从 s3 摄取到 Google Cloud 存储,然后运行 ​​sql 查询工作流以在 Big Query 上创建新表。 At the end of the workflow I need to push the output of the one final Big Query table to Google Cloud Storage and from there to S3.在工作流程结束时,我需要将最后一个 Big Query 表的输出推送到 Google Cloud Storage,然后从那里推送到 S3。

I have cracked the the transfer of the Big Query table to Google Cloud Storage with no issues using the BigQueryToCloudStorageOperator python operator.我已经使用BigQueryToCloudStorageOperator python 运算符破解了将 Big Query 表传输到 Google Cloud Storage 的问题。 However it seems the transfer from Google Cloud Storage to S3 is a less trodden route and I have been unable to find a solution which I can automate in my Airflow workflow.然而,从 Google Cloud Storage 到 S3 的转移似乎是一条不太容易走的路线,我一直无法找到可以在我的 Airflow 工作流程中自动化的解决方案。

I am aware of rsync which comes as part of the gsutil and have gotten this working (see post Exporting data from Google Cloud Storage to Amazon S3 ) but I am unable to add this into my workflow.我知道rsyncgsutil一部分,并且已经开始工作(请参阅将数据从 Google Cloud Storage 导出到 Amazon S3 一文),但我无法将其添加到我的工作流程中。

I have a dockerised airflow container running on a compute engine instance.我有一个在计算引擎实例上运行的 dockerised 气流容器。

Would really appreciate help solving this problem.非常感谢帮助解决这个问题。

Many thanks!非常感谢!

So we are also using rsync to move data between S3 and GCS, 因此,我们还使用rsync在S3和GCS之间移动数据,

You first need to get a bash script working, something like gsutil -m rsync -d -r gs://bucket/key s3://bucket/key 您首先需要使bash脚本正常工作,例如gsutil -m rsync -d -r gs://bucket/key s3://bucket/key

For s3 you also need to provide AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as environment variable. 对于s3,您还需要提供AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEY作为环境变量。

Then define your BashOperator and put it in your DAG file 然后定义您的BashOperator并将其放入DAG文件中

rsync_yesterday = BashOperator(task_id='rsync_task_' + table,
                                bash_command='Your rsync script',
                                dag=dag)

Google recommends using it's transfer service for transfers between cloud platforms. Google建议使用其transfer service在云平台之间进行传输。 You can programatically setup a transfer using their python API. 您可以使用其python API以编程方式设置传输。 This way the data is transferred directly between S3 and google cloud storage. 这样,数据直接在S3和Google云存储之间传输。 The disadvantage of using gsutil and rsync is that the data will have to go through the machine/instance which executes the rsync command. 使用gsutilrsync的缺点是数据必须经过执行rsync命令的计算机/实例。 This can be a bottle neck. 这可能是瓶颈。

Google Cloud Storage Transfer Service Doc Google云端存储传输服务文档

I had a requirement to copy objects from GC storage bucket to S3 using AWS Lambda. 我需要使用AWS Lambda将对象从GC存储桶复制到S3。

Python boto3 library allows listing and downloading objects from GC bucket. Python boto3库允许从GC存储桶中列出和下载对象。

Below is sample lambda code to copy "sample-data-s3.csv" object from GC bucket to s3 bucket. 以下是示例lambda代码,用于将“ sample-data-s3.csv”对象从GC存储桶复制到s3存储桶。

import boto3
import io

s3 = boto3.resource('s3')

google_access_key_id="GOOG1EIxxMYKEYxxMQ"
google_access_key_secret="QifDxxMYSECRETKEYxxVU1oad1b"

gc_bucket_name="my_gc_bucket"


def get_gcs_objects(google_access_key_id, google_access_key_secret,
                     gc_bucket_name):
    """Gets GCS objects using boto3 SDK"""
    client = boto3.client("s3", region_name="auto",
                          endpoint_url="https://storage.googleapis.com",
                          aws_access_key_id=google_access_key_id,
                          aws_secret_access_key=google_access_key_secret)

    # Call GCS to list objects in gc_bucket_name
    response = client.list_objects(Bucket=gc_bucket_name)

    # Print object names
    print("Objects:")
    for blob in response["Contents"]:
        print(blob)    

    object = s3.Object('my_aws_s3_bucket', 'sample-data-s3.csv')
    f = io.BytesIO()
    client.download_fileobj(gc_bucket_name,"sample-data.csv",f)
    object.put(Body=f.getvalue())

def lambda_handler(event, context):
    get_gcs_objects(google_access_key_id,google_access_key_secret,gc_bucket_name) 

You can loop through blob to download all objects from GC bucket. 您可以遍历blob以从GC存储桶下载所有对象。

Hope this helps someone who wants to use AWS lambda to transfer objects from GC bucket to s3 bucket. 希望这对希望使用AWS Lambda来将对象从GC存储桶传输到s3存储桶的人有所帮助。

The easiest overall option is gsutil rsync , however there are scenarios where rsync might take too many resources or won't be fast enough.最简单的整体选项是gsutil rsync ,但是在某些情况下 rsync 可能会占用太多资源或速度不够快。

Couple other alternatives:其他几种选择:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 gsutil 将文件从 Google Cloud Storage 存储桶 1 复制到存储桶 2,同时保留 ACL - How to copy a file from Google Cloud Storage bucket 1 to bucket 2 while keeping ACLs using gsutil Cloud Function 将存储桶中的所有文件复制到同一 GCS 存储桶内的文件夹中 - Cloud Function to copy all the files from a Storage bucket to a folder inside the same GCS storage bucket 如何从谷歌云存储桶下载对象? - How to download an object from google cloud storage bucket? 如何从AI Platform作业访问Google Cloud Storage Bucket - How to access Google Cloud Storage Bucket from AI Platform job 以编程方式从谷歌云存储桶下载文件 - Programatically download file from google cloud storage bucket 从python脚本使用谷歌云存储桶中的数据文件 - Use datafile at google cloud storage bucket from python script 使用python从谷歌云存储桶下载整个目录 - Download entire directories from google cloud storage bucket using python 无法从 TensorFlow 或 Keras 中的 Google Cloud Storage 存储桶加载图像 - Unable to load images from a Google Cloud Storage bucket in TensorFlow or Keras 从 GOOGLE CLOUD STORAGE BUCKET 下载多个文件 - Download multiple files from GOOGLE CLOUD STORAGE BUCKET 使用 Paramiko 将文件从 AWS S3 存储桶复制到 SFTP - Copy files from AWS S3 bucket to SFTP using Paramiko
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM