简体   繁体   中英

How to join PCollection in streaming mode on a static lookup table by key in Apache Beam (Python)

I'm streaming in (unbounded) data from Google Cloud Pubsub into a PCollection in the form of a dictionary. As the streamed data comes in, I'd like to enrich it by joining it by key on a static (bounded) lookup table. This table is small enough to live in memory.

I currently have a working solution that runs using the DirectRunner, but when I try to run it on the DataflowRunner, I get an error.

I've read the bounded lookup table in from a csv using the beam.io.ReadFromText function, and parsed the values into a dictionary. I've then created a ParDo function that takes my unbounded PCollection and the lookup dictionary as a side input. In the ParDo , it uses a generator to "join" on the correct row of the lookup table, and will enrich the input element.

Here's some of the main parts..


# Get bounded lookup table
lookup_dict = (pcoll | 'Read PS Table' >> beam.io.ReadFromText(...) 
| 'Split CSV to Dict' >> beam.ParDo(SplitCSVtoDict()))

# Use lookup table as side input in ParDo func to enrich unbounded pcoll
# I found that it only worked on my local machine when decorating it with AsList
enriched = pcoll | 'join pcoll on lkup' >> beam.ParDo(JoinLkupData(), data=beam.pvalue.AsList(lookup_dict)

class JoinLkupData(beam.DoFn):
    def process(self, element, lookup_data):
        # I used a generator here
        lkup = next((row for row in lookup_data if row[<JOIN_FIELD>]) == element[<JOIN_FIELD>]), None)

        if lkup:
           # If there is a join, add new fields to the pcoll
           element['field1'] = lkup['field1']
           element['field2'] = lkup['field2']
        yield element

I was able to get the correct result when running locally using DirectRunner, but when running on the DataFlow Runner, I receive this error:

apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error: Workflow failed. Causes: Expected custom source to have non-zero number of splits.

This post: " Error while splitting pcollections on Dataflow runner " made me think that the reason for this error has to do with the multiple workers not having access to the same lookup table when splitting the work.

In the future, please share the version of Beam and the stack trace if you can.

In this case, it is a known issue that the error message is not very good. At the time of this writing, Dataflow for Python streaming is limited to only Pubsub for reading and writing and BigQuery for writing. Using the text source in a pipeline results in this error.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM