简体   繁体   English

如何通过Apache Beam(Python)中的键以流模式在静态查找表上加入PCollection

[英]How to join PCollection in streaming mode on a static lookup table by key in Apache Beam (Python)

I'm streaming in (unbounded) data from Google Cloud Pubsub into a PCollection in the form of a dictionary. 我将Google Cloud Pubsub中的数据(无界)以字典的形式流到PCollection中。 As the streamed data comes in, I'd like to enrich it by joining it by key on a static (bounded) lookup table. 随着流数据的传入,我想通过在静态(有界)查找表上通过键将其加入来丰富数据。 This table is small enough to live in memory. 该表足够小以驻留在内存中。

I currently have a working solution that runs using the DirectRunner, but when I try to run it on the DataflowRunner, I get an error. 我目前有一个使用DirectRunner运行的解决方案,但是当我尝试在DataflowRunner上运行它时,出现错误。

I've read the bounded lookup table in from a csv using the beam.io.ReadFromText function, and parsed the values into a dictionary. 我已经使用beam.io.ReadFromText函数从csv中读取了有限的查找表,并将这些值解析为字典。 I've then created a ParDo function that takes my unbounded PCollection and the lookup dictionary as a side input. 然后,我创建了一个ParDo函数,该函数将无边界的PCollection和查找字典作为侧面输入。 In the ParDo , it uses a generator to "join" on the correct row of the lookup table, and will enrich the input element. ParDo ,它使用生成器在查找表的正确行上“加入”,并将丰富输入元素。

Here's some of the main parts.. 这是一些主要部分。


# Get bounded lookup table
lookup_dict = (pcoll | 'Read PS Table' >> beam.io.ReadFromText(...) 
| 'Split CSV to Dict' >> beam.ParDo(SplitCSVtoDict()))

# Use lookup table as side input in ParDo func to enrich unbounded pcoll
# I found that it only worked on my local machine when decorating it with AsList
enriched = pcoll | 'join pcoll on lkup' >> beam.ParDo(JoinLkupData(), data=beam.pvalue.AsList(lookup_dict)

class JoinLkupData(beam.DoFn):
    def process(self, element, lookup_data):
        # I used a generator here
        lkup = next((row for row in lookup_data if row[<JOIN_FIELD>]) == element[<JOIN_FIELD>]), None)

        if lkup:
           # If there is a join, add new fields to the pcoll
           element['field1'] = lkup['field1']
           element['field2'] = lkup['field2']
        yield element

I was able to get the correct result when running locally using DirectRunner, but when running on the DataFlow Runner, I receive this error: 使用DirectRunner在本地运行时,我能够获得正确的结果,但是在DataFlow Runner上运行时,出现以下错误:

apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error: Workflow failed. Causes: Expected custom source to have non-zero number of splits.

This post: " Error while splitting pcollections on Dataflow runner " made me think that the reason for this error has to do with the multiple workers not having access to the same lookup table when splitting the work. 这篇文章:“ 在数据流运行器上拆分集合时出错 ”使我认为,此错误的原因与拆分工作时多个工作人员无法访问同一查找表有关。

In the future, please share the version of Beam and the stack trace if you can. 将来,请尽可能共享Beam的版本和堆栈跟踪。

In this case, it is a known issue that the error message is not very good. 在这种情况下,错误消息不是很好是一个已知问题。 At the time of this writing, Dataflow for Python streaming is limited to only Pubsub for reading and writing and BigQuery for writing. 在撰写本文时,用于Python流的Dataflow仅限于用于读写的Pubsub和用于编写的BigQuery。 Using the text source in a pipeline results in this error. 在管道中使用文本源会导致此错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM