Apache beam python 在一个管道中使用多个共享处理程序

Question

我已经建立了一个管道，我试图使用apache_beam.utils.shared模块在工作人员之间共享 2 个不同的对象。 我的管道在两个不同的阶段需要不同的共享对象。 换句话说，在第一阶段，它将使用一个共享的 object。在另一个阶段，它将需要另一个共享的 object。我创建了一个测试管道来解释我的案例：

import apache_beam as beam
from apache_beam.utils import shared

input_data = [
    {"id": 1, "group": "A", "val": 20},  # y
    {"id": 2, "group": "A", "val": 30},  # y
    {"id": 3, "group": "A", "val": 10},  # y
    {"id": 4, "group": "A", "val": 10},  # n
    {"id": 5, "group": "B", "val": 40},  # y
    {"id": 6, "group": "B", "val": 50},  # n
    {"id": 7, "group": "B", "val": 70},  # y
    {"id": 8, "group": "B", "val": 80},  # n
    {"id": 9, "group": "C", "val": 20},  # y
    {"id": 10, "group": "C", "val": 5},  # n
]


class WeakRefDict(dict):
    pass

class WeakRefSet(set):
    pass


class OutlierDetector(beam.DoFn):

    def __init__(self, shared_handle):
        super().__init__()
        self._shared_handle = shared_handle

    def process(self, element):
        def construct_filter():
            # In reality this would be a much slower operation as it will read from database
            print("CALLED MAP!")
            filter_map = {"A": 25, "B": 60, "C": 30}
            return WeakRefDict(filter_map)

        filter_m = self._shared_handle.acquire(construct_filter)
        threshold = filter_m.get(element['group'], 0)
        is_outlier = False
        if element['val'] > threshold:
            is_outlier = True
        element['is_outlier'] = is_outlier
        yield element


class SimpleFilter(beam.DoFn):

    def __init__(self, shared_handle):
        super().__init__()
        self._shared_handle = shared_handle

    def process(self, element):
        def construct_filter():
            # In reality this would be a much slower operation as it will read from database
            print("CALLED FILTER!")
            filter_set = {1, 2, 3, 5, 7, 9}
            # filter_set = {}
            return WeakRefSet(filter_set)

        filter_m = self._shared_handle.acquire(construct_filter)
        if element['id'] in filter_m:
            pass
        else:
            yield element


shared_handle = shared.Shared()
# shared_handle_2 = shared.Shared()

with beam.Pipeline() as pipeline:
    data = pipeline | "Generate some data" >> beam.Create(input_data)

    if find_outliers:
        # Branch A
        step1a = data | 'Map to filters' >> beam.ParDo(OutlierDetector(shared_handle_1))
        step1a | "Print A" >> beam.ParDo(print)

    # Branch B
    step1b = data | 'Simple filters' >> beam.ParDo(SimpleFilter(shared_handle))
    step2b = step1b | "Map to key val" >> beam.Map(lambda x: (x['group'], x['val']))
    step3b = step2b | "Sum by group" >> beam.CombinePerKey(sum)
    step3b | "Print B" >> beam.ParDo(print)

但是问题如下：如果我使用相同的共享处理程序，似乎我无法获取不同的对象，但我似乎总是收到相同的 object。我会收到如下错误：

AttributeError: 'WeakRefSet' object has no attribute 'get' [while running 'Map to filters']

因为在OutlierDetector DoFn 中调用self._shared_handle.acquire(construct_filter)将返回一个集合而不是字典。

相反，如果我使用两个单独的共享处理程序，我的工作人员不会共享 object，而是代码每次都会调用construct_filter() function。 换句话说，我得到以下 output：

CALLED MAP!
{'id': 1, 'group': 'A', 'val': 20, 'is_outlier': False}
CALLED FILTER!
CALLED MAP!
{'id': 2, 'group': 'A', 'val': 30, 'is_outlier': True}
CALLED FILTER!
CALLED MAP!
{'id': 3, 'group': 'A', 'val': 10, 'is_outlier': False}
CALLED FILTER!
CALLED MAP!
{'id': 4, 'group': 'A', 'val': 10, 'is_outlier': False}
CALLED FILTER!
CALLED MAP!
{'id': 5, 'group': 'B', 'val': 40, 'is_outlier': False}
CALLED FILTER!
CALLED MAP!
{'id': 6, 'group': 'B', 'val': 50, 'is_outlier': False}
CALLED FILTER!
CALLED MAP!
{'id': 7, 'group': 'B', 'val': 70, 'is_outlier': True}
CALLED FILTER!
CALLED MAP!
{'id': 8, 'group': 'B', 'val': 80, 'is_outlier': True}
CALLED FILTER!
CALLED MAP!
{'id': 9, 'group': 'C', 'val': 20, 'is_outlier': False}
CALLED FILTER!
CALLED MAP!
{'id': 10, 'group': 'C', 'val': 5, 'is_outlier': False}
('A', 10)
('B', 130)
('C', 5)

在管道的两个独立阶段共享两个独立对象的最佳方式是什么？ 一种解决方法是加载所有内容并在一个共享处理程序中共享所有对象，但我发现这效率低下，因为我将不得不在多个工作人员之间复制大量未使用的东西。 特别是如果（就像我的情况）某些步骤可以是可选的。

Answer 1

我认为两个 DoFns 在同一阶段融合。

根据文档，一个阶段只有一个共享令牌可用。 因此，每当调用 acquire 时，它都会被覆盖。

您可以尝试以下方法（未测试）：

创建两个不同的共享句柄
通过重组防止融合。

Answer 2

在@Dakshin Rajavel 的请求下，我发布了一个模式详细答案。

import apache_beam as beam
from apache_beam.utils import shared

input_data = [
    {"id": 1, "group": "A", "val": 20},  # y
    {"id": 2, "group": "A", "val": 30},  # y
    {"id": 3, "group": "A", "val": 10},  # y
    {"id": 4, "group": "A", "val": 10},  # n
    {"id": 5, "group": "B", "val": 40},  # y
    {"id": 6, "group": "B", "val": 50},  # n
    {"id": 7, "group": "B", "val": 70},  # y
    {"id": 8, "group": "B", "val": 80},  # n
    {"id": 9, "group": "C", "val": 20},  # y
    {"id": 10, "group": "C", "val": 5},  # n
]


class WeakRef:
    def __init__(self, weak_ref_dict: dict, weak_ref_set: set):
        self.weak_ref_dict = weak_ref_dict
        self.weak_ref_set = weak_ref_set


def construct_filter():
    # In reality this would be a much slower operation as it will read from database
    print("CALLED GLOBAL MAPPER!")
    filter_map = {"A": 25, "B": 60, "C": 30}

    filter_set = {1, 2, 3, 5, 7, 9}
    # filter_set = {}

    return WeakRef(weak_ref_dict=filter_map, weak_ref_set=filter_set)


class OutlierDetector(beam.DoFn):

    def __init__(self, shared_handle):
        super().__init__()
        self._shared_handle = shared_handle

    def process(self, element):
        filter_m = self._shared_handle.acquire(construct_filter).weak_ref_dict
        threshold = filter_m.get(element['group'], 0)
        is_outlier = False
        if element['val'] > threshold:
            is_outlier = True
        element['is_outlier'] = is_outlier
        yield element


class SimpleFilter(beam.DoFn):

    def __init__(self, shared_handle):
        super().__init__()
        self._shared_handle = shared_handle

    def process(self, element):
        filter_m = self._shared_handle.acquire(construct_filter).weak_ref_set
        if element['id'] in filter_m:
            pass
        else:
            yield element


shared_handle = shared.Shared()

with beam.Pipeline() as pipeline:
    data = pipeline | "Generate some data" >> beam.Create(input_data)

    # Branch A
    step1a = data | 'Map to filters' >> beam.ParDo(OutlierDetector(shared_handle))
    step1a | "Print A" >> beam.ParDo(print)

    # Branch B
    step1b = data | 'Simple filters' >> beam.ParDo(SimpleFilter(shared_handle))
    step2b = step1b | "Map to key val" >> beam.Map(lambda x: (x['group'], x['val']))
    step3b = step2b | "Sum by group" >> beam.CombinePerKey(sum)
    step3b | "Print B" >> beam.ParDo(print)

基本上，我最终得到的解决方案创建了一个共享的 object 来加载和存储这两个类（以前存储在两个单独的处理程序中）。 这样我需要根据程序所在的 function 选择我需要的 object。

请不要忘记投票。

Apache beam python 在一个管道中使用多个共享处理程序

问题描述

2 个解决方案

解决方案1
0 2022-12-08 13:28:32

解决方案2
0 2023-01-30 17:02:48

Apache beam python 在一个管道中使用多个共享处理程序

问题描述

2 个解决方案

解决方案1 0 2022-12-08 13:28:32

解决方案2 0 2023-01-30 17:02:48

解决方案1
0 2022-12-08 13:28:32

解决方案2
0 2023-01-30 17:02:48