简体   繁体   English

如何从 s3 将镶木地板文件读入 PCollection?

[英]How to read a parquet file into a PCollection from s3?

My problem is simple: I want to read a parquet file from s3 into a PCollection in Apache Beam using the Python Sdk.我的问题很简单:我想使用 Python Sdk 将 parquet 文件从 s3 读入 Apache Beam 中的 PCollection。

I know of the apache_beam.io.parquetio module but this one does not seem to be able to read from s3 directly (or does it?).我知道apache_beam.io.parquetio模块,但这个模块似乎无法直接从 s3 读取(或者可以吗?)。

I know of the apache_beam.io.aws.s3io module but this one seems to return an s3 file object or something that is not a PCollection anyway (or does it?).我知道apache_beam.io.aws.s3io模块,但这个模块似乎返回一个 s3 文件 object 或无论如何都不是 PCollection 的东西(或者是吗?)。

So what's the best way to do this?那么最好的方法是什么?

if you install beam with the aws requirement如果你安装符合 aws 要求的 beam

pip install 'apache-beam[aws]'

You can just pass in an s3 filename to read from it你可以只传入一个 s3 文件名来读取它

filename = "s3://bucket-name/...
beam.io.ReadFromParquet(filenam)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM