Apache 光束侧输入在流数据流管道中不工作 Python SDK

Question

我正在处理一个更大的数据流管道，它在批处理模式下完美运行，但完成的重构确实存在侧输入问题。 如果我将管道置于流模式并删除侧输入，则管道在谷歌的数据流上完美运行。

如果剥离所有内容并构建以下简短脚本来封装问题并能够解决它。

import apache_beam as beam
import os
import json
import sys
import logging

""" Here the switches """
run_local = False
streaming = False

project = 'google-project-name'
bucket = 'dataflow_bucket'
tmp_location = 'gs://{}/{}/'.format(bucket, "tmp")
topic = "projects/{}/topics/dataflowtopic".format(project)

credentials_file = os.path.join(os.path.dirname(__file__), "credentials.json")
if os.path.exists(credentials_file):
    os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credentials_file

if not run_local:
    runner = "DataflowRunner"
else:
    runner = "DirectRunner"

argv = [
    '--region={0}'.format("europe-west6"),
    '--project={0}'.format(project),
    '--temp_location={0}'.format(tmp_location),
    '--runner={0}'.format(runner),
    '--save_main_session',
    '--max_num_workers=2'
]
if streaming:
    argv.append('--streaming')

def filter_existing(item, existing=None, id_field: str = 'id'):
    sys.stdout.write(".")
    if not item.get(id_field) in existing:
        return True
    return False


class UserProcessor(beam.DoFn):
    def process(self, user, **kwargs):

        if run_local:
            print("processing an user and getting items for it")
        else:
            logging.info("processing an user and getting items for it")
        for i in range(10):
            yield {"id": i, "item_name": "dingdong", "user": user}


class ExistingProcessor(beam.DoFn):
    def process(self, user, **kwargs):
        if run_local:
            print("creating the list to exclude items from the result")
        else:
            logging.info("creating the list to exclude items from the result")
        yield {"id": 3}
        yield {"id": 5}
        yield {"id": 9}


with beam.Pipeline(argv=argv) as p:
    if streaming:
        if run_local:
            print("Waiting for Pub/Sub Message on Topic: {}".format(topic))
        else:
            logging.info("Waiting for Pub/Sub Message on Topic: {}".format(topic))
        users = p | "Loading user" >> beam.io.ReadFromPubSub(topic=topic) | beam.Map(lambda x: json.loads(x.decode()))
    else:
        if run_local:
            print("Loading Demo User")
        else:
            logging.info("Loading Demo User")

        example_user = {"id": "indi","name": "Indiana Jones"}
        users = p | "Loading user" >> beam.Create([example_user])

    process1 = users | "load all items for user" >> beam.ParDo(UserProcessor().with_input_types(dict))
    process2 = users | "load existing items for user" >> beam.ParDo(ExistingProcessor().with_input_types(dict)) | beam.Map(lambda x: x.get('id'))

    if run_local:
        process1 | "debug process1" >> beam.Map(print)
        process2 | "debug process2" >> beam.Map(print)

    #filtered = (process1, process2) | beam.Flatten() # this works
    filtered = process1 | "filter all against existing items" >> beam.Filter(filter_existing, existing=beam.pvalue.AsList(process2)) # this does NOT work when streaming, it does in batch

    if run_local:
        filtered | "debug filtered" >> beam.Map(print)
        filtered | "write down result" >> beam.io.WriteToText(os.path.join(os.path.dirname(__file__), "test_result.txt"))
    else:
        filtered | "write down result" >> beam.io.WriteToText("gs://{}/playground/output.txt".format(bucket))

    if run_local:
        print("pipeline initialized!")
    else:
        logging.info("pipeline initialized!")

在 Google 的 Dataflow 中将此脚本作为批处理作业运行正是它需要做的事情。 查看从数据流中可视化的管道：

一旦设置了变量streaming = True和run_local = False ，该作业就不再起作用并且根本不返回任何错误。 我有点迷路了。

如果我在这部分脚本中禁用侧输入：

# this works in streaming AND batch mode
# filtered = (process1, process2) | beam.Flatten() # this works

# this does NOT work in streaming mode, but DOES works in batch mode
filtered = process1 | "filter all against existing items" >> beam.Filter(filter_existing, existing=beam.pvalue.AsList(process2))

然后数据流上的流式作业起作用。 我已经缩小了导致问题的部分beam.pvalue.AsList(process2) 。

不妨说，我对 Apache Beam 还很陌生。 我可以解决的一件事是问题是侧面输入没有被窗口化（这意味着什么，只是从文档中得到它；））

Answer 1

你是对的，问题是侧面输入没有窗口化。 当您使用侧面输入运行 Parallel Do (Map, FlatMap, Filter, ...) 时，运行器会等到侧面输入“完全计算”后再运行主输入。 在这种情况下，它是从没有窗口的 PubSub 源读取的，这意味着它永远不会“完成”（即将来可能会出现更多数据）。

要完成这项工作，您需要 window 两侧，这样侧面输入将变为“完成时间 X”，然后当 X 从 window 边界向前跳转到 Z05B8C74CBD96FBF2DE4C1A35270 边界时，过滤器可以“最多运行 X”。

Answer 2

感谢@robertwb 的消息，我能够修复我的代码。 我已经更改了beam.io.ReadFromPubSub的管道代码并添加了一个窗口 PTransform：

            users = p | "Loading user" >> beam.io.ReadFromPubSub(topic=topic) | beam.Map(lambda x: json.loads(x.decode())) \
                | "Window into Fixed Intervals" >> beam.WindowInto(beam.window.FixedWindows(10))

不确定这是否是此用例的最佳 window，但它让我向前迈出了一大步。

Apache 光束侧输入在流数据流管道中不工作 Python SDK

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-02-16 18:18:15

解决方案2
0 2021-02-16 21:30:05

Apache 光束侧输入在流数据流管道中不工作 Python SDK

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-02-16 18:18:15

解决方案2 0 2021-02-16 21:30:05

解决方案1
2 已采纳 2021-02-16 18:18:15

解决方案2
0 2021-02-16 21:30:05