There is an Airflow operator GCSToLocalFilesystemOperator
to copy ONE file from GCS bucket to the local filesystem. But it supports only one file and it is not possible to copy many files for a given prefix.
There is a reverse operator LocalFilesystemToGCSOperator
that allows to copy many files from local filesystem to the bucket, you do it simply with the star in the path "/*".
Do you know what is the best way to copy files by the prefix from a bucket to the local filesystem in Airflow? Am I missing something or it is not just implemented for some reason?
The solution I came up so far is compressing the files before putting it to the bucket, download as one file with airflow and unzip with BashOperator
locally. I'm wondering if there is a better way.
I was able to successfully copy multiple files from GCS bucket to local filesystem(mapped) for a given prefix in Airflow using the below approach.
import datetime
from airflow import models
from airflow.operators import bash
from airflow.providers.google.cloud.hooks.gcs import GCSHook
from airflow.operators import python
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
BUCKET_NAME = 'qpalzm-bucket'
GCS_FILES = ['luffy.jpg', 'zoro.jpg']
LOCAL_PATH = '/home/airflow/gcs/data'
PREFIX = 'testfolder'
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
#
with models.DAG(
'multi_copy_gcs_to_local',
catchup=False,
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
def multi_copy(**kwargs):
hook = GCSHook()
for gcs_file in GCS_FILES:
#initialize file name and the local directory where it will be copied
filename = f'{LOCAL_PATH}/{gcs_file}'
#check if PREFIX is available and initialize the gcs file to be copied
if PREFIX:
object_name = f'{PREFIX}/{gcs_file}'
else:
object_name = f'{gcs_file}'
#perform gcs hook download
hook.download(
bucket_name = BUCKET_NAME,
object_name = object_name,
filename = filename
)
#execute multi_copy method
multi_copy_op = python.PythonOperator(
task_id='multi_gcs_to_local',
provide_context=True,
python_callable=multi_copy,
)
multi_copy_op
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.