Apache beam python to use multiple shared handler in one single pipeline

Question

I have built a pipeline where I am trying to share 2 different objects across workers using the apache_beam.utils.shared module. My pipeline needs different shared objects in two separate stages. In other words in the first stage it will use one shared object. In another stage it will need another shared object. I have created a test pipeline to explain my case:

import apache_beam as beam
from apache_beam.utils import shared

input_data = [
    {"id": 1, "group": "A", "val": 20},  # y
    {"id": 2, "group": "A", "val": 30},  # y
    {"id": 3, "group": "A", "val": 10},  # y
    {"id": 4, "group": "A", "val": 10},  # n
    {"id": 5, "group": "B", "val": 40},  # y
    {"id": 6, "group": "B", "val": 50},  # n
    {"id": 7, "group": "B", "val": 70},  # y
    {"id": 8, "group": "B", "val": 80},  # n
    {"id": 9, "group": "C", "val": 20},  # y
    {"id": 10, "group": "C", "val": 5},  # n
]


class WeakRefDict(dict):
    pass

class WeakRefSet(set):
    pass


class OutlierDetector(beam.DoFn):

    def __init__(self, shared_handle):
        super().__init__()
        self._shared_handle = shared_handle

    def process(self, element):
        def construct_filter():
            # In reality this would be a much slower operation as it will read from database
            print("CALLED MAP!")
            filter_map = {"A": 25, "B": 60, "C": 30}
            return WeakRefDict(filter_map)

        filter_m = self._shared_handle.acquire(construct_filter)
        threshold = filter_m.get(element['group'], 0)
        is_outlier = False
        if element['val'] > threshold:
            is_outlier = True
        element['is_outlier'] = is_outlier
        yield element


class SimpleFilter(beam.DoFn):

    def __init__(self, shared_handle):
        super().__init__()
        self._shared_handle = shared_handle

    def process(self, element):
        def construct_filter():
            # In reality this would be a much slower operation as it will read from database
            print("CALLED FILTER!")
            filter_set = {1, 2, 3, 5, 7, 9}
            # filter_set = {}
            return WeakRefSet(filter_set)

        filter_m = self._shared_handle.acquire(construct_filter)
        if element['id'] in filter_m:
            pass
        else:
            yield element


shared_handle = shared.Shared()
# shared_handle_2 = shared.Shared()

with beam.Pipeline() as pipeline:
    data = pipeline | "Generate some data" >> beam.Create(input_data)

    if find_outliers:
        # Branch A
        step1a = data | 'Map to filters' >> beam.ParDo(OutlierDetector(shared_handle_1))
        step1a | "Print A" >> beam.ParDo(print)

    # Branch B
    step1b = data | 'Simple filters' >> beam.ParDo(SimpleFilter(shared_handle))
    step2b = step1b | "Map to key val" >> beam.Map(lambda x: (x['group'], x['val']))
    step3b = step2b | "Sum by group" >> beam.CombinePerKey(sum)
    step3b | "Print B" >> beam.ParDo(print)

However the problem is the following: If i use the same shared handler it seems that I am not able to acquire different objects, but I seem to receive always the same object. I would get an error like the following:

AttributeError: 'WeakRefSet' object has no attribute 'get' [while running 'Map to filters']

Because the call self._shared_handle.acquire(construct_filter) will return a set rather than a dictionary in the OutlierDetector DoFn.

If instead I use two separate shared handlers my workers do not share the object and instead the code will call the construct_filter() function every time. In other words I get the following output:

CALLED MAP!
{'id': 1, 'group': 'A', 'val': 20, 'is_outlier': False}
CALLED FILTER!
CALLED MAP!
{'id': 2, 'group': 'A', 'val': 30, 'is_outlier': True}
CALLED FILTER!
CALLED MAP!
{'id': 3, 'group': 'A', 'val': 10, 'is_outlier': False}
CALLED FILTER!
CALLED MAP!
{'id': 4, 'group': 'A', 'val': 10, 'is_outlier': False}
CALLED FILTER!
CALLED MAP!
{'id': 5, 'group': 'B', 'val': 40, 'is_outlier': False}
CALLED FILTER!
CALLED MAP!
{'id': 6, 'group': 'B', 'val': 50, 'is_outlier': False}
CALLED FILTER!
CALLED MAP!
{'id': 7, 'group': 'B', 'val': 70, 'is_outlier': True}
CALLED FILTER!
CALLED MAP!
{'id': 8, 'group': 'B', 'val': 80, 'is_outlier': True}
CALLED FILTER!
CALLED MAP!
{'id': 9, 'group': 'C', 'val': 20, 'is_outlier': False}
CALLED FILTER!
CALLED MAP!
{'id': 10, 'group': 'C', 'val': 5, 'is_outlier': False}
('A', 10)
('B', 130)
('C', 5)

What would be the best way to share two separate objects in two separate stage of the pipeline? A work-around would be to load everything and share all objects in one shared handler but I find this inefficient as I will have to copy a lot of unused stuff across multiple workers. Especially if (like in my case) some steps can be optional.

Answer 1

I think both the DoFns are fused in the same stage.

As per the documentation, only one shared token will be available in a stage. Hence, it is being overwritten whenever acquire is called.

You can try the following approach (Not tested):

Create two different shared handles
Prevent fusion by reshuffling.

Answer 2

on @Dakshin Rajavel request I am posting a mode detailed answer.

import apache_beam as beam
from apache_beam.utils import shared

input_data = [
    {"id": 1, "group": "A", "val": 20},  # y
    {"id": 2, "group": "A", "val": 30},  # y
    {"id": 3, "group": "A", "val": 10},  # y
    {"id": 4, "group": "A", "val": 10},  # n
    {"id": 5, "group": "B", "val": 40},  # y
    {"id": 6, "group": "B", "val": 50},  # n
    {"id": 7, "group": "B", "val": 70},  # y
    {"id": 8, "group": "B", "val": 80},  # n
    {"id": 9, "group": "C", "val": 20},  # y
    {"id": 10, "group": "C", "val": 5},  # n
]


class WeakRef:
    def __init__(self, weak_ref_dict: dict, weak_ref_set: set):
        self.weak_ref_dict = weak_ref_dict
        self.weak_ref_set = weak_ref_set


def construct_filter():
    # In reality this would be a much slower operation as it will read from database
    print("CALLED GLOBAL MAPPER!")
    filter_map = {"A": 25, "B": 60, "C": 30}

    filter_set = {1, 2, 3, 5, 7, 9}
    # filter_set = {}

    return WeakRef(weak_ref_dict=filter_map, weak_ref_set=filter_set)


class OutlierDetector(beam.DoFn):

    def __init__(self, shared_handle):
        super().__init__()
        self._shared_handle = shared_handle

    def process(self, element):
        filter_m = self._shared_handle.acquire(construct_filter).weak_ref_dict
        threshold = filter_m.get(element['group'], 0)
        is_outlier = False
        if element['val'] > threshold:
            is_outlier = True
        element['is_outlier'] = is_outlier
        yield element


class SimpleFilter(beam.DoFn):

    def __init__(self, shared_handle):
        super().__init__()
        self._shared_handle = shared_handle

    def process(self, element):
        filter_m = self._shared_handle.acquire(construct_filter).weak_ref_set
        if element['id'] in filter_m:
            pass
        else:
            yield element


shared_handle = shared.Shared()

with beam.Pipeline() as pipeline:
    data = pipeline | "Generate some data" >> beam.Create(input_data)

    # Branch A
    step1a = data | 'Map to filters' >> beam.ParDo(OutlierDetector(shared_handle))
    step1a | "Print A" >> beam.ParDo(print)

    # Branch B
    step1b = data | 'Simple filters' >> beam.ParDo(SimpleFilter(shared_handle))
    step2b = step1b | "Map to key val" >> beam.Map(lambda x: (x['group'], x['val']))
    step3b = step2b | "Sum by group" >> beam.CombinePerKey(sum)
    step3b | "Print B" >> beam.ParDo(print)

Basically the solution I ended up with creates a single shared object that loads and store both classes (which previously where stored in two separate handlers). This way i need to pick the object i need depending of which function the program is in in.

Please do not forget to upvote.

Apache beam python to use multiple shared handler in one single pipeline

Question

2 answers

solution1
0 2022-12-08 13:28:32

solution2
0 2023-01-30 17:02:48

Apache beam python to use multiple shared handler in one single pipeline

Question

2 answers

solution1 0 2022-12-08 13:28:32

solution2 0 2023-01-30 17:02:48

solution1
0 2022-12-08 13:28:32

solution2
0 2023-01-30 17:02:48