简体   繁体   中英

How To Combine Parsed TextFiles In Apache-Beam DataFlow in Python?

This seems to work fine in DirectRunner, but errors out when I switch to DataflowRunner. I basically need to somehow combine the files that are read in, but as soon as I use beam.combiners.ToList() to concatenate my data, it introduces a whole slew of issues.

Code Example:

def convert_to_dataframe(readable_file):
    yield pd.read_csv(io.TextIOWrapper(readable_file.open()))

class merge_dataframes(beam.DoFn):
    def process(self, element):
        yield pd.concat(element).reset_index(drop=True)

    with beam.Pipeline(options=pipeline_options) as p:

        (p
            | 'Match Files From GCS' >> beam.io.fileio.MatchFiles(raw_data_path)
            | 'Read Files' >> beam.io.fileio.ReadMatches()
            | 'Shuffle' >> beam.Reshuffle()
            | 'Create DataFrames' >> beam.FlatMap(convert_to_dataframe)
            | 'Combine To List' >> beam.combiners.ToList()
            | 'Merge DataFrames' >> beam.ParDo(merge_dataframes())
            | 'Apply Transformations' >> beam.ParDo(ApplyPipeline(creds_path=args.creds_path,
                                                                  project_name=args.project_name,
                                                                  feature_group_name=args.feature_group_name
                                                                  ))
            | 'Write To GCS' >> beam.io.WriteToText(feature_data_path,
                                                    file_name_suffix='.csv',
                                                    shard_name_template='')
         )

Error:

"No objects to concatenate [while running 'Merge DataFrames']" 

I don't understand this error because the part that does 'Combine To List' should have produced a list of dataframes that would then get passed into the step 'Merge DataFrames', which is indeed the case when I use DirectRunner.

鉴于此错误,我怀疑MatchFiles实际上没有匹配任何内容(例如,由于文件模式错误),因此, beam.combiners.ToList的输出是一个空列表。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM