简体   繁体   English

使用 Apache Airflow 编辑存储在 AWS S3 中的 CSV 无需下载

[英]Use Apache Airflow to edit CSV stored in AWS S3 without download

I have a project that requires large amounts of CSV data to be transformed regularly.我有一个项目需要定期转换大量 CSV 数据。 This data will be stored in S3 and I am using an EC2 instance running Ubuntu server 16.04 to perform edits to the data and Apache Airflow to route the data.此数据将存储在 S3 中,我正在使用运行 Ubuntu 服务器 16.04 的 EC2 实例对数据和 Apache ZD1662521E6B89809B85A825FCEB7B 数据执行编辑。 Downloading and reuploading this data to S3 is quite expensive, is there a way I can edit this CSV data in memory without downloading the file to local storage on the Ubuntu instance?将这些数据下载并重新上传到 S3 非常昂贵,有没有办法可以在 memory 中编辑此 ZCC8D68C551C4A9A9A6D5313E07DE4DEAFDZ 数据,而无需将文件下载到 Z3D945423F8E9496C449A5D8C65B46 实例上的本地存储?

Thank you in advance先感谢您

In general you could think of program that will fetch the CSV file from s3 (using s3 sdk) and store it and transform in memory and then save back to s3.一般来说,您可以考虑从 s3 获取 CSV 文件(使用 s3 sdk)并将其存储并转换到 memory 然后保存回 s3 的程序。 But it will still require "downloading and reuploading".但它仍然需要“下载和重新上传”。 The only difference is that file won't be physically stored to local disk but kept in program memory.唯一的区别是文件不会物理存储到本地磁盘,而是保存在程序 memory 中。

You could also use s3fs to mount s3 bucket to a directory on server and perform requested operations directly on the files.您还可以使用s3fs将 s3 存储桶挂载到服务器上的目录并直接对文件执行请求的操作。 But they still need to be downloaded from s3 and reuploaded there (although it will be on-the-fly and invisible to you).但是它们仍然需要从 s3 下载并重新上传到那里(尽管它是即时的并且对您不可见)。

Hope that helps.希望有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM