简体   繁体   English

Apache 光束 dataframe 将 csv 写入没有分片名称模板的 GCS

[英]Apache beam dataframe write csv to GCS without shard name template

I have a Dataflow pipeline using Apache Beam dataframe, and I'd like to write the csv to a GCS bucket.我有一个使用 Apache Beam dataframe 的数据流管道,我想将 csv 写入 GCS 存储桶。 This is my code:这是我的代码:

with beam.Pipeline(options=pipeline_options) as p:
    df = p | read_csv(known_args.input)
    df[column] = df.groupby(primary_key)[column].apply(lambda x: x.ffill().bfill()))
    df.to_csv(known_args.output, index=False, encoding='utf-8')

However, while I pass a gcs path to known_args.output , the written csv on gcs is added with shard, like this gs://path/to/file-00000-of-00001 .但是,当我将 gcs 路径传递给known_args.output时,在 gcs 上写入的 csv 会添加分片,例如gs://path/to/file-00000-of-00001 For my project, I need the file name to be without the shard.对于我的项目,我需要文件名没有分片。 I've read the documentation but there seems to be no options to remove the shard.我已阅读文档,但似乎没有删除碎片的选项。 I tried converting the df back to pcollection and use WriteToText but it doesn't work either, and also not a desirable solution.我尝试将 df 转换回 pcollection 并使用WriteToText但它也不起作用,也不是一个理想的解决方案。

It looks like you're right;看起来你是对的; in Beam 2.40 there's no way to customize the sharding of these dataframe write operations.在 Beam 2.40 中,无法自定义这些 dataframe 写入操作的分片。 Instead, you'll have to use convert to a PCollection and use WriteToText(..., shard_name_template='')相反,您必须使用转换为 PCollection 并使用WriteToText(..., shard_name_template='')

I filed BEAM-22923 .我提交了BEAM-22923 When the relevant PR is merged this fixed will allow one to pass an explicit file naming parameter (which will allow customization of this as well as windowing information), eg当相关的PR被合并时,这个固定将允许传递一个明确的文件命名参数(这将允许自定义这个以及窗口信息),例如

df.to_csv(
    output_dir,
    num_shards=1,
    file_naming=fileio.single_file_naming('out.csv'))

. .

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 apache_beam,在管道期间从 GCS 存储桶中读取数据 - apache_beam, read data from GCS buckets during pipeline 如何在 Apache Beam Java 中写入带有动态标头的 CSV 文件 - How do I write CSV file with dynamic headers in Apache Beam Java apache-beam 从 GCS 桶的多个文件夹中读取多个文件并加载它 biquery python - apache-beam reading multiple files from multiple folders of GCS buckets and load it biquery python Apache Beam 使用 Go 写入 PubSub 消息 - Apache Beam Write PubSub messages using Go Apache Beam - 将 BigQuery TableRow 写入 Cassandra - Apache Beam - Write BigQuery TableRow to Cassandra Apache Beam BigqueryIO.Write getSuccessfulInserts 不工作 - Apache Beam BigqueryIO.Write getSuccessfulInserts not working Apache Beam 管道写入多个 BQ 表 - Apache Beam Pipeline Write to Multiple BQ tables Dataflow (Apache Beam) 无法写入 BigQuery - Dataflow (Apache Beam) can't write on BigQuery 如何从谷歌数据流 apache 光束 python 中的 GCS 存储桶中读取多个 JSON 文件 - How to read multiple JSON files from GCS bucket in google dataflow apache beam python 使用 GCP 数据流和 Apache Beam Python SDK 从 GCS 读取速度非常慢 - Incredibly slow read from GCS with GCP Dataflow & Apache Beam Python SDK
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM