我們可以在 apache-beam 的批處理管道中使用 Windows + GroupBy 或 State & timely 打破 fusion b/w ParDo 嗎？

Question

上下文：我有 N 個請求，我需要為其放置獲取請求 (FetchData() ParDo)，該請求又返回一個結果，我可以使用該結果下載數據 (DownloadData() ParDo)。 現在這些 ParDo 正在融合，因為單個工作發出獲取請求並下載數據然后再次發出請求並下載數據等等。

所以我想並行化這些步驟，以便在我從獲取步驟 + 獲取步驟獲得結果時立即開始下載數據以放置另一個請求，同時在下一步中下載一些數據。

嘗試打破融合：

            request
            | 'Fetch' >> beam.ParDo(FetchData())
            | "GlobalWindow" >> beam.WindowInto(
                    window.GlobalWindows(),
                    trigger=trigger.Repeatedly(
                        trigger.AfterAny(
                            trigger.AfterProcessingTime(int(1.0 * 1)),
                            trigger.AfterCount(1)
                        )),
                    accumulation_mode=trigger.AccumulationMode.DISCARDING)
            | 'GroupBy' >> beam.GroupBy()
            | 'Download' >> beam.ParDo(DownloadData())

實際上，我想打破 FetchData() 和 DownloadData() ParDo 的融合，所以我想到了使用 GlobalWindows() 的方法，然后使用 GroupBy() 對每個窗口元素進行分組，並將其進一步發送到 DownloadData() ParDo，同時FetchData() ParDo 並行工作。

但我在這里觀察到的是 GroupBy() 在將其進一步發送到 DownloadData() ParDo 之前累積所有元素（在其步驟之前等待所有元素首先得到處理）。

我做對了嗎？ 無論如何讓 GroupBy() 提前返回？ 或者任何人有任何其他方法來實現我的目標？

更新：

Attempt-2 使用 states & 及時打破融合：

                request
                | 'Fetch' >> beam.ParDo())
                | "SetRequestKey" >> beam.ParDo(SetRequestKeyFn())
                | 'RequestBucket' >> beam.ParDo(RequestBucket())
                | 'Download' >> beam.ParDo(DownloadData())

#Sets the request_id as the key
class SetRequestKeyFn(beam.DoFn):
    def process(self, element):
        return element[2]['href'], element


class RequestBucket(beam.DoFn):
    """Stateful ParDo for storing requests."""
    REQUEST_STATE = userstate.BagStateSpec('requests', DillCoder())
    EXPIRY_TIMER = userstate.TimerSpec('expiry_timer', userstate.TimeDomain.REAL_TIME)

    def process(self,
                element,
                request_state=beam.DoFn.StateParam(REQUEST_STATE),
                timer=beam.DoFn.TimerParam(EXPIRY_TIMER)):

        logger.info(f"Adding new state {element[0]}.")
        request_state.add(element)
        # Set a timer to go off 0 seconds in the future.
        timer.set(Timestamp.now() + Duration(seconds=0))

    @userstate.on_timer(EXPIRY_TIMER)
    def expiry_callback(self, request_state=beam.DoFn.StateParam(REQUEST_STATE)):
        """"""
        requests = list(request_state.read())
        request_state.clear()
        logger.info(f'Yielding for {requests!r}...')
        yield requests[0]

在這里，此SetRequestKeyFn() ParDo等待其步驟之前的所有元素首先得到處理，然后再將其進一步發送到RequestBucket ParDo。

Answer 1

在批處理中，所有融合障礙（包括 GroupByKey）都是全局障礙，即上游的一切都在下游的一切開始之前完成。

如果問題是 FetchData() 具有高扇出，您可以做的一件事是嘗試提前拆分便宜的扇出，然后添加重新洗牌，即

request
| ComputeFetchOperations()
| beam.Reshuffle()
| FetchOne()
| DownloadData()
...

這仍然會融合FetchOne和DownloadData操作，但其他提取可以由其他線程（或工作者）並行處理。

您還可以按照 apache beam 中的異步 API 調用中的描述進行多線程 DoFn。

另一種選擇是嘗試將其編寫為流式管道，盡管這可能會帶來額外的復雜性。

我們可以在 apache-beam 的批處理管道中使用 Windows + GroupBy 或 State & timely 打破 fusion b/w ParDo 嗎？

問題描述

1 個解決方案

解決方案1
0 2022-12-15 21:58:28

我們可以在 apache-beam 的批處理管道中使用 Windows + GroupBy 或 State &amp; timely 打破 fusion b/w ParDo 嗎？

問題描述

1 個解決方案

解決方案1 0 2022-12-15 21:58:28

我們可以在 apache-beam 的批處理管道中使用 Windows + GroupBy 或 State & timely 打破 fusion b/w ParDo 嗎？

解決方案1
0 2022-12-15 21:58:28