简体   繁体   中英

Can we break fusion b/w ParDo using Windows + GroupBy or State & timely in batch pipeline of apache-beam?

Context: I have N requests for which I need to place the fetch request (FetchData() ParDo) which in turn returns a result using which I can download the data (DownloadData() ParDo). Now these ParDo are getting fused due to which a single work place the fetch request & download the data then again place the request & download the data and so on.

So I want to parallelize these steps such that data starts to download as soon as I get the result from fetch step + fetch step to place another request while some data is getting downloaded in the next step.

Attempt to break the fusion:

            request
            | 'Fetch' >> beam.ParDo(FetchData())
            | "GlobalWindow" >> beam.WindowInto(
                    window.GlobalWindows(),
                    trigger=trigger.Repeatedly(
                        trigger.AfterAny(
                            trigger.AfterProcessingTime(int(1.0 * 1)),
                            trigger.AfterCount(1)
                        )),
                    accumulation_mode=trigger.AccumulationMode.DISCARDING)
            | 'GroupBy' >> beam.GroupBy()
            | 'Download' >> beam.ParDo(DownloadData())

Actually I want to break the fusion wrt FetchData() & DownloadData() ParDo, so I thought of this approach to have a GlobalWindows() & then use GroupBy() to group each window elements and send it further to DownloadData() ParDo while FetchData() ParDo works in parallel.

But what I'm observing here is that GroupBy() accumulates all the elements (waits for all the elements before its step to get processed first) before sending it further to DownloadData() ParDo.

Am I doing the right thing? Anyway to make GroupBy() return early? Or anyone have any other approach to achieve my goal?

Update:

Attempt-2 to break the fusion using states & timely:

                request
                | 'Fetch' >> beam.ParDo())
                | "SetRequestKey" >> beam.ParDo(SetRequestKeyFn())
                | 'RequestBucket' >> beam.ParDo(RequestBucket())
                | 'Download' >> beam.ParDo(DownloadData())

#Sets the request_id as the key
class SetRequestKeyFn(beam.DoFn):
    def process(self, element):
        return element[2]['href'], element


class RequestBucket(beam.DoFn):
    """Stateful ParDo for storing requests."""
    REQUEST_STATE = userstate.BagStateSpec('requests', DillCoder())
    EXPIRY_TIMER = userstate.TimerSpec('expiry_timer', userstate.TimeDomain.REAL_TIME)

    def process(self,
                element,
                request_state=beam.DoFn.StateParam(REQUEST_STATE),
                timer=beam.DoFn.TimerParam(EXPIRY_TIMER)):

        logger.info(f"Adding new state {element[0]}.")
        request_state.add(element)
        # Set a timer to go off 0 seconds in the future.
        timer.set(Timestamp.now() + Duration(seconds=0))

    @userstate.on_timer(EXPIRY_TIMER)
    def expiry_callback(self, request_state=beam.DoFn.StateParam(REQUEST_STATE)):
        """"""
        requests = list(request_state.read())
        request_state.clear()
        logger.info(f'Yielding for {requests!r}...')
        yield requests[0]

Here also this SetRequestKeyFn() ParDo waits for all the elements before its step to get processed first before sending it further to RequestBucket ParDo.

In batch all fusion barriers (including GroupByKey) are global barriers, ie everything upstream is completed before everything downstream is started.

If the issue is that FetchData() has high fanout, one thing you could do is try to split the cheap fannout ahead of time and then add a reshuffle, ie

request
| ComputeFetchOperations()
| beam.Reshuffle()
| FetchOne()
| DownloadData()
...

this would still fuse the FetchOne and DownloadData operations, but other fetches could be handled by other threads (or workers) in parallel.

You could also look into doing a multi-threaded DoFn as described at asynchronous API calls in apache beam .

Another option could be to try and write this as a streaming pipeline instead, though that may introduce additional complexities.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM