如何在 Python 的 Apache-Beam DataFlow 中合並解析的文本文件？

Question

這在 DirectRunner 中似乎工作正常，但是當我切換到 DataflowRunner 時出錯。 我基本上需要以某種方式組合讀入的文件，但是一旦我使用beam.combiners.ToList()連接我的數據，它就會引入一系列問題。

代碼示例：

def convert_to_dataframe(readable_file):
    yield pd.read_csv(io.TextIOWrapper(readable_file.open()))

class merge_dataframes(beam.DoFn):
    def process(self, element):
        yield pd.concat(element).reset_index(drop=True)

    with beam.Pipeline(options=pipeline_options) as p:

        (p
            | 'Match Files From GCS' >> beam.io.fileio.MatchFiles(raw_data_path)
            | 'Read Files' >> beam.io.fileio.ReadMatches()
            | 'Shuffle' >> beam.Reshuffle()
            | 'Create DataFrames' >> beam.FlatMap(convert_to_dataframe)
            | 'Combine To List' >> beam.combiners.ToList()
            | 'Merge DataFrames' >> beam.ParDo(merge_dataframes())
            | 'Apply Transformations' >> beam.ParDo(ApplyPipeline(creds_path=args.creds_path,
                                                                  project_name=args.project_name,
                                                                  feature_group_name=args.feature_group_name
                                                                  ))
            | 'Write To GCS' >> beam.io.WriteToText(feature_data_path,
                                                    file_name_suffix='.csv',
                                                    shard_name_template='')
         )

錯誤：

"No objects to concatenate [while running 'Merge DataFrames']"

我不明白這個錯誤，因為執行“組合到列表”的部分應該生成一個數據幀列表，然后將其傳遞到步驟“合並數據幀”中，這在我使用 DirectRunner 時確實如此。

Answer 1

鑒於此錯誤，我懷疑MatchFiles實際上沒有匹配任何內容（例如，由於文件模式錯誤），因此， beam.combiners.ToList的輸出是一個空列表。

如何在 Python 的 Apache-Beam DataFlow 中合並解析的文本文件？

問題描述

1 個解決方案

解決方案1
1 已采納 2020-02-13 23:40:28

如何在 Python 的 Apache-Beam DataFlow 中合並解析的文本文件？

問題描述

1 個解決方案

解決方案1 1 已采納 2020-02-13 23:40:28

解決方案1
1 已采納 2020-02-13 23:40:28