简体   繁体   English

如何在 Python 的 Apache-Beam DataFlow 中合并解析的文本文件?

[英]How To Combine Parsed TextFiles In Apache-Beam DataFlow in Python?

This seems to work fine in DirectRunner, but errors out when I switch to DataflowRunner.这在 DirectRunner 中似乎工作正常,但是当我切换到 DataflowRunner 时出错。 I basically need to somehow combine the files that are read in, but as soon as I use beam.combiners.ToList() to concatenate my data, it introduces a whole slew of issues.我基本上需要以某种方式组合读入的文件,但是一旦我使用beam.combiners.ToList()连接我的数据,它就会引入一系列问题。

Code Example:代码示例:

def convert_to_dataframe(readable_file):
    yield pd.read_csv(io.TextIOWrapper(readable_file.open()))

class merge_dataframes(beam.DoFn):
    def process(self, element):
        yield pd.concat(element).reset_index(drop=True)

    with beam.Pipeline(options=pipeline_options) as p:

        (p
            | 'Match Files From GCS' >> beam.io.fileio.MatchFiles(raw_data_path)
            | 'Read Files' >> beam.io.fileio.ReadMatches()
            | 'Shuffle' >> beam.Reshuffle()
            | 'Create DataFrames' >> beam.FlatMap(convert_to_dataframe)
            | 'Combine To List' >> beam.combiners.ToList()
            | 'Merge DataFrames' >> beam.ParDo(merge_dataframes())
            | 'Apply Transformations' >> beam.ParDo(ApplyPipeline(creds_path=args.creds_path,
                                                                  project_name=args.project_name,
                                                                  feature_group_name=args.feature_group_name
                                                                  ))
            | 'Write To GCS' >> beam.io.WriteToText(feature_data_path,
                                                    file_name_suffix='.csv',
                                                    shard_name_template='')
         )

Error:错误:

"No objects to concatenate [while running 'Merge DataFrames']" 

I don't understand this error because the part that does 'Combine To List' should have produced a list of dataframes that would then get passed into the step 'Merge DataFrames', which is indeed the case when I use DirectRunner.我不明白这个错误,因为执行“组合到列表”的部分应该生成一个数据帧列表,然后将其传递到步骤“合并数据帧”中,这在我使用 DirectRunner 时确实如此。

鉴于此错误,我怀疑MatchFiles实际上没有匹配任何内容(例如,由于文件模式错误),因此, beam.combiners.ToList的输出是一个空列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python/Apache-Beam:如何将文本文件解析为 CSV? - Python/Apache-Beam: How to Parse Text File To CSV? 如何将两个结果和 pipe 组合到 apache-beam 管道中的下一步 - How to combine two results and pipe it to next step in apache-beam pipeline Apache-Beam + Python:将JSON(或字典)字符串写入输出文件 - Apache-Beam + Python: Writing JSON (or dictionaries) strings to output file 在 python 编码的 Apache-Beam 管道中提供 BigQuery 凭证 - Provide BigQuery credentials in Apache-Beam pipeline coded in python 为什么我的 Apache-Beam Python 库安装失败? - Why is my install failing for Apache-Beam Python library? 如何在流传输管道中添加重复数据删除[apache-beam] - How to add de-duplication to a streaming pipeline [apache-beam] 在从 CircleCI 启动的 Dataflow/Apache-beam 作业中找不到库 - Libraries cannot be found on Dataflow/Apache-beam job launched from CircleCI 如何在Python 3.x上获取数据流GCP的apache beam - How to get apache beam for dataflow GCP on Python 3.x 在 Python 数据流/Apache Beam 上启动 CloudSQL 代理 - Start CloudSQL Proxy on Python Dataflow / Apache Beam 在 GCP 数据流上使用 python apache 光束中的 scipy - Using scipy in python apache beam on GCP Dataflow
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM