简体   繁体   中英

How to use Airflow AWS connection credentials in Airflow using BashOprator to transfer files from AWS s3 bucket to GCS

As I am working with two clouds, My task is to rsync files coming into s3 bucket to gcs bucket. To achieve this I am using GCP composer (Airflow) service where I am scheduling this rsync operation to sync files. I am using Airflow connection (aws_default) to store AWS access key and secret access key. Everything is working fine but thing is that I am able to see credentials in logs which is again exposing credentials and I don't want to display them even in logs. Please assist if there is any way so that credentials should not display in logs.

import airflow
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.hooks.base_hook import BaseHook
from datetime import timedelta, datetime

START_TIME = datetime.utcnow() - timedelta(hours=1)

default_args = {
    'owner': 'airflow',
    'depends_on_past': True,
    'wait_for_downstream': True,
    'start_date': START_TIME,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 3,
    'retry_delay': timedelta(minutes=3)
}

aws_connection = BaseHook.get_connection('aws_default')

bash_env = {
        "AWS_ACCESS_KEY_ID": aws_connection.login,
        "AWS_SECRET_ACCESS_KEY": aws_connection.password
}

rsync_command = ''' 
    set -e; 
    export AWS_ACCESS_KEY_ID="%s"; 
    export AWS_SECRET_ACCESS_KEY="%s"; 
''' %(bash_env.get('AWS_ACCESS_KEY_ID'), bash_env.get('AWS_SECRET_ACCESS_KEY')) \
+  '''
    gsutil -m rsync -r -n s3://aws_bucket/{{ execution_date.strftime('%Y/%m/%d/%H') }}/ gs://gcp_bucket/good/test/
'''

dag = DAG(
    'rsync',
    default_args=default_args,
    description='This dag is for gsutil rsync from s3 buket to gcs storage',
    schedule_interval=timedelta(minutes=20),
    dagrun_timeout=timedelta(minutes=15)
    )


s3_sync = BashOperator(
    task_id='gsutil_s3_gcp_sync',
    bash_command=rsync_command,
    dag=dag,
    depends_on_past=False,
    execution_timeout=timedelta(hours=1),
    )

I would suggest putting the credentials in a boto config file separate from Airflow. More info on config file here

It has a credential section:

[Credentials]
  aws_access_key_id
  aws_secret_access_key
  gs_access_key_id
  gs_host
  gs_host_header
  gs_json_host
  gs_json_host_header
  gs_json_port
  gs_oauth2_refresh_token
  gs_port
  gs_secret_access_key
  gs_service_client_id
  gs_service_key_file
  gs_service_key_file_password
  s3_host
  s3_host_header
  s3_port

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM