Apache 光束 dataframe 将 csv 写入没有分片名称模板的 GCS

Question

I have a Dataflow pipeline using Apache Beam dataframe, and I'd like to write the csv to a GCS bucket.我有一个使用 Apache Beam dataframe 的数据流管道，我想将 csv 写入 GCS 存储桶。 This is my code:这是我的代码：

with beam.Pipeline(options=pipeline_options) as p:
    df = p | read_csv(known_args.input)
    df[column] = df.groupby(primary_key)[column].apply(lambda x: x.ffill().bfill()))
    df.to_csv(known_args.output, index=False, encoding='utf-8')

However, while I pass a gcs path to known_args.output , the written csv on gcs is added with shard, like this gs://path/to/file-00000-of-00001 .但是，当我将 gcs 路径传递给known_args.output时，在 gcs 上写入的 csv 会添加分片，例如gs://path/to/file-00000-of-00001 。 For my project, I need the file name to be without the shard.对于我的项目，我需要文件名没有分片。 I've read the documentation but there seems to be no options to remove the shard.我已阅读文档，但似乎没有删除碎片的选项。 I tried converting the df back to pcollection and use WriteToText but it doesn't work either, and also not a desirable solution.我尝试将 df 转换回 pcollection 并使用WriteToText但它也不起作用，也不是一个理想的解决方案。

Answer 1

It looks like you're right;看起来你是对的； in Beam 2.40 there's no way to customize the sharding of these dataframe write operations.在 Beam 2.40 中，无法自定义这些 dataframe 写入操作的分片。 Instead, you'll have to use convert to a PCollection and use WriteToText(..., shard_name_template='')相反，您必须使用转换为 PCollection 并使用WriteToText(..., shard_name_template='')

I filed BEAM-22923 .我提交了BEAM-22923 。 When the relevant PR is merged this fixed will allow one to pass an explicit file naming parameter (which will allow customization of this as well as windowing information), eg当相关的PR被合并时，这个固定将允许传递一个明确的文件命名参数（这将允许自定义这个以及窗口信息），例如

df.to_csv(
    output_dir,
    num_shards=1,
    file_naming=fileio.single_file_naming('out.csv'))

. .

Apache 光束 dataframe 将 csv 写入没有分片名称模板的 GCS

问题描述

1 个解决方案

解决方案1
0 2022-08-26 21:22:04

Apache 光束 dataframe 将 csv 写入没有分片名称模板的 GCS

问题描述

1 个解决方案

解决方案1 0 2022-08-26 21:22:04

解决方案1
0 2022-08-26 21:22:04