简体   繁体   English

如何在airflow dag中用pandas直接从s3读取excel文件?

[英]How to directly read excel file from s3 with pandas in airflow dag?

I am trying to read an excel file from s3 inside an aiflow dag with python, but it does not seem to work.我正在尝试从带有 python 的 aiflow dag 中的 s3 读取一个 excel 文件,但它似乎不起作用。 It is very weird because it works when I read it from outside airflow with pd.read_excel(s3_excel_path).这很奇怪,因为当我使用 pd.read_excel(s3_excel_path) 从外部 airflow 读取它时它起作用了。

What I did:我做了什么:

  • Set AWS credential in airflow (this works well as I can list my s3 bucket)在 airflow 中设置 AWS 凭据(这很好用,因为我可以列出我的 s3 存储桶)
  • Install pandas, s3fs in my Docker environment where I run Airflow在我运行Airflow的Docker环境安装pandas、s3fs
  • Try to read the file with pd.read_excel(s3_excel_path)尝试使用 pd.read_excel(s3_excel_path) 读取文件

As I said, it works when I try it outside of Airflow. Moreover, I don't get any error, the dag just continues to run undefinitely (at the step where it is supposed to read the file) and nothing happens, even if I wait 20 minutes.正如我所说,当我在 Airflow 之外尝试时,它会起作用。此外,我没有收到任何错误,dag 只是继续无限期地运行(在它应该读取文件的步骤)并且没有任何反应,即使我等20分钟。

(I would like to avoir to download the file from s3, process it and then upload it back to s3, which is why I am trying to read it directly from s3) (我想避免从 s3 下载文件,对其进行处理,然后将其上传回 s3,这就是我尝试直接从 s3 读取它的原因)

Note: I does not work with csv as well.注意:我也不使用 csv。

EDIT: Likewise, I can't save my dataframe directly to S3 with df.to_csv('s3_path') in airflow dag while I can do it in python编辑:同样,我不能将我的 dataframe 直接保存到 S3,在 airflow dag 中使用 df.to_csv('s3_path'),而我可以在 python 中保存

To read data files stored in S3 using pandas, you have two options, download them using boto3 (or AWS CLI) and read local files, which is the solution you are not locking for, and use s3fs API supported by pandas:要使用 pandas 读取存储在 S3 中的数据文件,您有两种选择,使用 boto3(或 AWS CLI)下载它们并读取本地文件,这是您未锁定的解决方案,并使用 pandas 支持的 s3fs API:

import os

import pandas as pd

AWS_S3_BUCKET = os.getenv("AWS_S3_BUCKET")
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
AWS_SESSION_TOKEN = os.getenv("AWS_SESSION_TOKEN")

key = "path/to/excel/file"

books_df = pd.read_excel(
    f"s3://{AWS_S3_BUCKET}/{key}",
    storage_options={
        "key": AWS_ACCESS_KEY_ID,
        "secret": AWS_SECRET_ACCESS_KEY,
        "token": AWS_SESSION_TOKEN,
    },
)

to use this solution, you need to install s3fs and apache-airflow-providers-amazon要使用此解决方案,您需要安装s3fsapache-airflow-providers-amazon

pip install s3fs
pip install apache-airflow-providers-amazon

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM