简体   繁体   中英

Airflow to copy most recent file from GCS bucket to local

I want to copy latest file from a gcs bucket to local using airflow composer. I was trying to use gustil cp to get the latest file and load into local airflow but got issue: CommandException: No URLs matched error. If I check the XCom I am getting value='Objects'.Any suggestion?

download_file = BashOperator(
   task_id='download_file',
   bash_command="gsutil cp $(gsutil ls -l gs://<bucket_name> | sort -k 2 | tail -1 | awk '''{print $3}''') /home/airflow/gcs/dags",
        xcom_push=True
    )

Executing the gsutil command gsutil ls -l gs://<bucket_name> | sort -k 2 | tail -1 | awk '''{print $3}''' gsutil ls -l gs://<bucket_name> | sort -k 2 | tail -1 | awk '''{print $3}''' gsutil ls -l gs://<bucket_name> | sort -k 2 | tail -1 | awk '''{print $3}''' will also display the row with total size, objects and etc., will sort by date then get the last row and get the third column of row. That's why you get 'objects' as value like the output sample below:

TOTAL: 6 objects, 28227013 bytes (26.92 MiB)

Try this code to get the second last row only:

download_file = BashOperator(
   task_id='download_file',
   bash_command="gsutil cp $(gsutil ls -l gs://bucket_name | sort -k 2 | tail -2 | head -n1 | awk '''{print $3}''') /home/airflow/gcs/dags",
        xcom_push=True
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM