简体   繁体   中英

Apache beam dataframe write csv to GCS without shard name template

I have a Dataflow pipeline using Apache Beam dataframe, and I'd like to write the csv to a GCS bucket. This is my code:

with beam.Pipeline(options=pipeline_options) as p:
    df = p | read_csv(known_args.input)
    df[column] = df.groupby(primary_key)[column].apply(lambda x: x.ffill().bfill()))
    df.to_csv(known_args.output, index=False, encoding='utf-8')

However, while I pass a gcs path to known_args.output , the written csv on gcs is added with shard, like this gs://path/to/file-00000-of-00001 . For my project, I need the file name to be without the shard. I've read the documentation but there seems to be no options to remove the shard. I tried converting the df back to pcollection and use WriteToText but it doesn't work either, and also not a desirable solution.

It looks like you're right; in Beam 2.40 there's no way to customize the sharding of these dataframe write operations. Instead, you'll have to use convert to a PCollection and use WriteToText(..., shard_name_template='')

I filed BEAM-22923 . When the relevant PR is merged this fixed will allow one to pass an explicit file naming parameter (which will allow customization of this as well as windowing information), eg

df.to_csv(
    output_dir,
    num_shards=1,
    file_naming=fileio.single_file_naming('out.csv'))

.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM