如何通過Apache Beam（Python）中的鍵以流模式在靜態查找表上加入PCollection

Question

我將Google Cloud Pubsub中的數據（無界）以字典的形式流到PCollection中。 隨着流數據的傳入，我想通過在靜態（有界）查找表上通過鍵將其加入來豐富數據。 該表足夠小以駐留在內存中。

我目前有一個使用DirectRunner運行的解決方案，但是當我嘗試在DataflowRunner上運行它時，出現錯誤。

我已經使用beam.io.ReadFromText函數從csv中讀取了有限的查找表，並將這些值解析為字典。 然后，我創建了一個ParDo函數，該函數將無邊界的PCollection和查找字典作為側面輸入。 在ParDo ，它使用生成器在查找表的正確行上“加入”，並將豐富輸入元素。

這是一些主要部分。


# Get bounded lookup table
lookup_dict = (pcoll | 'Read PS Table' >> beam.io.ReadFromText(...) 
| 'Split CSV to Dict' >> beam.ParDo(SplitCSVtoDict()))

# Use lookup table as side input in ParDo func to enrich unbounded pcoll
# I found that it only worked on my local machine when decorating it with AsList
enriched = pcoll | 'join pcoll on lkup' >> beam.ParDo(JoinLkupData(), data=beam.pvalue.AsList(lookup_dict)

class JoinLkupData(beam.DoFn):
    def process(self, element, lookup_data):
        # I used a generator here
        lkup = next((row for row in lookup_data if row[<JOIN_FIELD>]) == element[<JOIN_FIELD>]), None)

        if lkup:
           # If there is a join, add new fields to the pcoll
           element['field1'] = lkup['field1']
           element['field2'] = lkup['field2']
        yield element

使用DirectRunner在本地運行時，我能夠獲得正確的結果，但是在DataFlow Runner上運行時，出現以下錯誤：

apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error: Workflow failed. Causes: Expected custom source to have non-zero number of splits.

這篇文章：“ 在數據流運行器上拆分集合時出錯 ”使我認為，此錯誤的原因與拆分工作時多個工作人員無法訪問同一查找表有關。

Answer 1

將來，請盡可能共享Beam的版本和堆棧跟蹤。

在這種情況下，錯誤消息不是很好是一個已知問題。 在撰寫本文時，用於Python流的Dataflow僅限於用於讀寫的Pubsub和用於編寫的BigQuery。 在管道中使用文本源會導致此錯誤。

如何通過Apache Beam（Python）中的鍵以流模式在靜態查找表上加入PCollection

問題描述

1 個解決方案

解決方案1
0 2019-09-05 04:44:24

如何通過Apache Beam（Python）中的鍵以流模式在靜態查找表上加入PCollection

問題描述

1 個解決方案

解決方案1 0 2019-09-05 04:44:24

解決方案1
0 2019-09-05 04:44:24