简体   繁体   中英

Airflow BashOperator - Use different role then its pod role

I've tried to run the following commands as part of a bash script runs in BashOperator:

aws cli ls s3://bucket
aws cli cp ... ...

The script runs successfully, however the aws cli commands return error, showing that aws cli doesn't run with the needed permissions (as was defined in airflow-worker-node role)

Investigating the error:

  1. I've upgraded awscli in the docker running the pod - to version 2.4.9 (I've understood that old version of awscli doesn't support access to s3 based on permission grant by aws role

  2. I've Investigated the pod running my bash_script using the BashOperator:

  • Using k9s, and D (describe) command:

    • I saw that ARN_ROLE is defined correctly
  • Using k9s, and s (shell) command:

    • I saw that pod environment variables are correct.
    • aws cli worked with the needed permissions and can access s3 as needed.
    • aws sts get-caller-identity - reported the right role ( airflow-worker-node )
  1. Running the above commands as part of the bash-script which was executed in the BashOperator gave me different results:

    • Running env showed limited amount of env variables
    • aws cli returned permission related error.
    • aws sts get-caller-identity - reported the eks role ( eks-worker-node )

How can I grant aws cli in my BashOperator bash-script the needed permissions?

Reviewing the BashOperator source code, I've noticed the following code:

https://github.com/apache/airflow/blob/main/airflow/operators/bash.py

def get_env(self, context):
    """Builds the set of environment variables to be exposed for the bash command"""
    system_env = os.environ.copy()
    env = self.env
    if env is None:
        env = system_env
    else:
        if self.append_env:
            system_env.update(env)
            env = system_env

And the following documentation:

:param env: If env is not None, it must be a dict that defines the
    environment variables for the new process; these are used instead
    of inheriting the current process environment, which is the default
    behavior. (templated)
:type env: dict
:param append_env: If False(default) uses the environment variables passed in env params
    and does not inherit the current process environment. If True, inherits the environment variables
    from current passes and then environment variable passed by the user will either update the existing
    inherited environment variables or the new variables gets appended to it
:type append_env: bool

If bash operator input env variables is None, it copies the env variables of the father process. In my case, I provided some env variables therefore it didn't copy the env variables of the father process into the chid process - which caused the child process (the BashOperator process) to use the default arn_role of eks-worker-node.

The simple solution is to set the following flag in BashOperator(): append_env=True which will append all existing env variables to the env variables I added manually.

I've figured out that in the version I'm running (2.0.1) it isn't supported (it is supported in later versions). As a temp solution I've add **os.environ - to the BashOperator env parameter:

return BashOperator(
    task_id="copy_data_from_mcd_s3",
    env={
        "dag_input": "{{ dag_run.conf }}",
        ......
        **os.environ,
    },
    # append_env=True,- should be supported in 2.2.0
    bash_command="utils/my_script.sh",
    dag=dag,
    retries=1,
)

Which solve the problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM