简体   繁体   中英

Airflow operator to copy many files (directory, prefix) from Google Cloud Storage bucket to local filesystem

There is an Airflow operator GCSToLocalFilesystemOperator to copy ONE file from GCS bucket to the local filesystem. But it supports only one file and it is not possible to copy many files for a given prefix.

There is a reverse operator LocalFilesystemToGCSOperator that allows to copy many files from local filesystem to the bucket, you do it simply with the star in the path "/*".

Do you know what is the best way to copy files by the prefix from a bucket to the local filesystem in Airflow? Am I missing something or it is not just implemented for some reason?

The solution I came up so far is compressing the files before putting it to the bucket, download as one file with airflow and unzip with BashOperator locally. I'm wondering if there is a better way.

I was able to successfully copy multiple files from GCS bucket to local filesystem(mapped) for a given prefix in Airflow using the below approach.

import datetime

from airflow import models
from airflow.operators import bash
from airflow.providers.google.cloud.hooks.gcs import GCSHook
from airflow.operators import python


YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
BUCKET_NAME = 'qpalzm-bucket'
GCS_FILES = ['luffy.jpg', 'zoro.jpg']
LOCAL_PATH = '/home/airflow/gcs/data'
PREFIX = 'testfolder'

default_args = {
    'owner': 'Composer Example',
    'depends_on_past': False,
    'email': [''],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': datetime.timedelta(minutes=5),
    'start_date': YESTERDAY,
}
#
with models.DAG(
        'multi_copy_gcs_to_local',
        catchup=False,
        default_args=default_args,
        schedule_interval=datetime.timedelta(days=1)) as dag:

    def multi_copy(**kwargs):
        hook = GCSHook()

        for gcs_file in GCS_FILES:
            #initialize file name and the local directory where it will be copied
            filename = f'{LOCAL_PATH}/{gcs_file}'
            
            #check if PREFIX is available and initialize the gcs file to be copied
            if PREFIX:
                object_name = f'{PREFIX}/{gcs_file}'
            
            else:
                object_name = f'{gcs_file}'

            #perform gcs hook download
            hook.download(
                bucket_name = BUCKET_NAME,
                object_name = object_name,
                filename = filename
            )

    #execute multi_copy method
    multi_copy_op = python.PythonOperator(
            task_id='multi_gcs_to_local',
            provide_context=True,
            python_callable=multi_copy,
            )

    multi_copy_op

Output: 在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM