[英]How to read a parquet file into a PCollection from s3?
My problem is simple: I want to read a parquet file from s3 into a PCollection in Apache Beam using the Python Sdk.我的问题很简单:我想使用 Python Sdk 将 parquet 文件从 s3 读入 Apache Beam 中的 PCollection。
I know of the apache_beam.io.parquetio
module but this one does not seem to be able to read from s3 directly (or does it?).我知道
apache_beam.io.parquetio
模块,但这个模块似乎无法直接从 s3 读取(或者可以吗?)。
I know of the apache_beam.io.aws.s3io
module but this one seems to return an s3 file object or something that is not a PCollection anyway (or does it?).我知道
apache_beam.io.aws.s3io
模块,但这个模块似乎返回一个 s3 文件 object 或无论如何都不是 PCollection 的东西(或者是吗?)。
So what's the best way to do this?那么最好的方法是什么?
if you install beam with the aws requirement如果你安装符合 aws 要求的 beam
pip install 'apache-beam[aws]'
You can just pass in an s3 filename to read from it你可以只传入一个 s3 文件名来读取它
filename = "s3://bucket-name/...
beam.io.ReadFromParquet(filenam)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.