简体   繁体   English

在按日期分区的两个 GCS 存储桶之间复制文件

[英]Copy files between two GCS bucket which is partitioned by date

I have a requirement to copy the file between two bucket detailed below -我需要在下面详述的两个存储桶之间复制文件 -

Bucket A /folder A is source inbound box for daily files which are created as f1_abc_20210304_000 > I want to scan the latest file in folder A (10 files every day) and copy the latest file and next > Copy it in to Bucket B/Folder B / FILE name (ie from 10 files) / 2021/03/04 and drop the files in 04 folder. Bucket A /folder A是日常文件的源入站框,创建为 f1_abc_20210304_000 > 我想扫描文件夹 A 中的最新文件(每天 10 个文件)并复制最新文件,然后复制 > 将其复制到Bucket B/Folder B / FILE name (ie from 10 files) / 2021/03/04 并将文件拖放到 04 文件夹中。

Any suggestion how I should proceed with the design?有什么建议我应该如何进行设计?

Thanks RG谢谢RG

One approach is by using client libraries, for the example below I'm using the python client library for google cloud storage .一种方法是使用客户端库,对于下面的示例,我使用python 客户端库进行谷歌云存储

move.py移动.py

from google.cloud import storage
from google.oauth2 import service_account
import os 

# as mention on https://cloud.google.com/docs/authentication/production
key_path = "credentials.json"
credentials = service_account.Credentials.from_service_account_file(key_path)

storage_client = storage.Client(credentials=credentials)

bucket_name = "source-bucket-id" 
destination_bucket_name = "destination-bucket-id"

source_bucket = storage_client.bucket(bucket_name)
# prefix 'original_data' is the folder where i store the data
array_blobs = source_bucket.list_blobs(prefix='original_data')

filtered_dict = []

for blob in array_blobs:
    if str(blob.name).endswith('.csv'):
        #add additional logic to handle the files you want to ingest
        filtered_dict.append({'name':blob.name,'time':blob.time_created})

orderedlist = sorted(filtered_dict, key=lambda d: d['time'], reverse=True) 
latestblob = orderedlist[0]['name']

# prefix 'destination_data' is the folder where i want to move the data
destination_blob_name = "destination_data/{}".format(os.path.basename(latestblob))


source_blob = source_bucket.blob(latestblob)
destination_bucket = storage_client.bucket(destination_bucket_name)

blob_copy = source_bucket.copy_blob(source_blob, destination_bucket, destination_blob_name)

print(
        "Blob {} in bucket {} copied to blob {} in bucket {}.".format(
            source_blob.name,
            source_bucket.name,
            blob_copy.name,
            destination_bucket.name,
        )
    )

For a bit of context on the code, what I did was to use the google cloud storage python client , log in, get the list of files from my source folder original_data inside bucket source-bucket-id and add the relevant files ( you can modify the pick up logic by adding your own criteria which fits your situation ).对于代码的一些上下文,我所做的是使用谷歌云存储 python 客户端,登录,从存储桶source-bucket-id中的源文件夹original_data中获取文件列表并添加相关文件(你可以通过添加适合您情况的标准来修改拾取逻辑)。 After that I pick up the latest files based on time creation and use that name to move it into my destination-bucket-id .之后,我根据创建的时间选择最新的文件,并使用该名称将其移动到我的destination-bucket-id中。 As a note, destination_bucket_name variable includes the folder where I want to allocate the file and also the end filename.请注意, destination_bucket_name变量包括我要分配文件的文件夹以及结束文件名。

UPDATE : I miss the airflow tag.更新:我错过了 airflow 标签。 So on that case you should use the operator that comes with google provider which is GCSToGCSOperator .因此,在这种情况下,您应该使用 google provider 附带的运算符GCSToGCSOperator The parameters to pass can be obtained using a python operator and pass it to your operator.可以使用python operator获取要传递的参数,并将其传递给您的运营商。 It will work like this:它会像这样工作:

@task(task_id="get_gcs_params")
def get_gcs_params(**kwargs):
    date = kwargs["next_ds"]
    # logic should be as displayed on move.py 
    # ...
    return {"source_objects":source,"destination_object":destination}

gcs_params = get_gcs_params()

copy_file = GCSToGCSOperator(
    task_id='copy_single_file',
    source_bucket='data',
    source_objects= gcs_params.output['source_objects'],
    destination_bucket='data_backup',
    destination_object= gcs_params.output['destination_object'],
    gcp_conn_id=google_cloud_conn_id
)  

For additional guidance you can check the cloud storage examples list .如需更多指导,您可以查看云存储示例列表 I useCopy an object between buckets for guidance.我使用Copy an object between buckets作为指导。

Did you want to do this copy task using Airflow?您要使用 Airflow 执行此复制任务吗?

If yes, Airflow provide GCSToGCSOperator如果有,Airflow提供GCSToGCSOperator

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM