[英]Apache beam dataframe write csv to GCS without shard name template
I have a Dataflow pipeline using Apache Beam dataframe, and I'd like to write the csv to a GCS bucket.我有一个使用 Apache Beam dataframe 的数据流管道,我想将 csv 写入 GCS 存储桶。 This is my code:这是我的代码:
with beam.Pipeline(options=pipeline_options) as p:
df = p | read_csv(known_args.input)
df[column] = df.groupby(primary_key)[column].apply(lambda x: x.ffill().bfill()))
df.to_csv(known_args.output, index=False, encoding='utf-8')
However, while I pass a gcs path to known_args.output
, the written csv on gcs is added with shard, like this gs://path/to/file-00000-of-00001
.但是,当我将 gcs 路径传递给known_args.output
时,在 gcs 上写入的 csv 会添加分片,例如gs://path/to/file-00000-of-00001
。 For my project, I need the file name to be without the shard.对于我的项目,我需要文件名没有分片。 I've read the documentation but there seems to be no options to remove the shard.我已阅读文档,但似乎没有删除碎片的选项。 I tried converting the df back to pcollection and use WriteToText
but it doesn't work either, and also not a desirable solution.我尝试将 df 转换回 pcollection 并使用WriteToText
但它也不起作用,也不是一个理想的解决方案。
It looks like you're right;看起来你是对的; in Beam 2.40 there's no way to customize the sharding of these dataframe write operations.在 Beam 2.40 中,无法自定义这些 dataframe 写入操作的分片。 Instead, you'll have to use convert to a PCollection and use WriteToText(..., shard_name_template='')
相反,您必须使用转换为 PCollection 并使用WriteToText(..., shard_name_template='')
I filed BEAM-22923 .我提交了BEAM-22923 。 When the relevant PR is merged this fixed will allow one to pass an explicit file naming parameter (which will allow customization of this as well as windowing information), eg当相关的PR被合并时,这个固定将允许传递一个明确的文件命名参数(这将允许自定义这个以及窗口信息),例如
df.to_csv(
output_dir,
num_shards=1,
file_naming=fileio.single_file_naming('out.csv'))
. .
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.