如何通过Apache Beam（Python）中的键以流模式在静态查找表上加入PCollection

Question

我将Google Cloud Pubsub中的数据（无界）以字典的形式流到PCollection中。 随着流数据的传入，我想通过在静态（有界）查找表上通过键将其加入来丰富数据。 该表足够小以驻留在内存中。

我目前有一个使用DirectRunner运行的解决方案，但是当我尝试在DataflowRunner上运行它时，出现错误。

我已经使用beam.io.ReadFromText函数从csv中读取了有限的查找表，并将这些值解析为字典。 然后，我创建了一个ParDo函数，该函数将无边界的PCollection和查找字典作为侧面输入。 在ParDo ，它使用生成器在查找表的正确行上“加入”，并将丰富输入元素。

这是一些主要部分。


# Get bounded lookup table
lookup_dict = (pcoll | 'Read PS Table' >> beam.io.ReadFromText(...) 
| 'Split CSV to Dict' >> beam.ParDo(SplitCSVtoDict()))

# Use lookup table as side input in ParDo func to enrich unbounded pcoll
# I found that it only worked on my local machine when decorating it with AsList
enriched = pcoll | 'join pcoll on lkup' >> beam.ParDo(JoinLkupData(), data=beam.pvalue.AsList(lookup_dict)

class JoinLkupData(beam.DoFn):
    def process(self, element, lookup_data):
        # I used a generator here
        lkup = next((row for row in lookup_data if row[<JOIN_FIELD>]) == element[<JOIN_FIELD>]), None)

        if lkup:
           # If there is a join, add new fields to the pcoll
           element['field1'] = lkup['field1']
           element['field2'] = lkup['field2']
        yield element

使用DirectRunner在本地运行时，我能够获得正确的结果，但是在DataFlow Runner上运行时，出现以下错误：

apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error: Workflow failed. Causes: Expected custom source to have non-zero number of splits.

这篇文章：“ 在数据流运行器上拆分集合时出错 ”使我认为，此错误的原因与拆分工作时多个工作人员无法访问同一查找表有关。

Answer 1

将来，请尽可能共享Beam的版本和堆栈跟踪。

在这种情况下，错误消息不是很好是一个已知问题。 在撰写本文时，用于Python流的Dataflow仅限于用于读写的Pubsub和用于编写的BigQuery。 在管道中使用文本源会导致此错误。

如何通过Apache Beam（Python）中的键以流模式在静态查找表上加入PCollection

问题描述

1 个解决方案

解决方案1
0 2019-09-05 04:44:24

如何通过Apache Beam（Python）中的键以流模式在静态查找表上加入PCollection

问题描述

1 个解决方案

解决方案1 0 2019-09-05 04:44:24

解决方案1
0 2019-09-05 04:44:24