数据流作业不产生任何结果 output

Question

I have an issue where the dataflow job actually runs fine, but it does not produce any output until the job is manually drained.我有一个问题，数据流作业实际上运行良好，但在手动耗尽作业之前它不会产生任何 output。

With the following code I was assuming that it would produce windowed output, effectively triggering after each window.使用以下代码，我假设它会产生窗口化的 output，在每个 window 之后有效触发。

    lines = (
        p
        | "read" >> source
        | "decode" >> beam.Map(decode_message)
        | "Parse" >> beam.Map(parse_json)
        | beam.WindowInto(
               beam.window.FixedWindows(5*60),
               trigger=beam.trigger.Repeatedly(beam.trigger.AfterProcessingTime(5*60)),
               accumulation_mode=beam.trigger.AccumulationMode.DISCARDING)
        | "write" >> sink 
    )

What I want is that if it has received events in a window, it should produce output after the window in any case.我想要的是，如果它在 window 中接收到事件，那么无论如何它都应该在 window 之后生成 output。 The source is a Cloud PubSub, with approximately 100 events/minute.源是 Cloud PubSub，每分钟大约有 100 个事件。

This is the parameters i use to start the job:这是我用来启动作业的参数：

    python main.py \
        --region $(REGION) \
        --service_account_email $(SERVICE_ACCOUNT_EMAIL_TEST) \
        --staging_location gs://$(BUCKET_NAME_TEST)/beam_stage/ \
        --project $(TEST_PROJECT_ID) \
        --inputTopic $(TOPIC_TEST) \
        --outputLocation gs://$(BUCKET_NAME_TEST)/beam_output/ \
        --streaming \
        --runner DataflowRunner \
        --temp_location gs://$(BUCKET_NAME_TEST)/beam_temp/ \
        --experiments=allow_non_updatable_job \
        --disk_size_gb=200 \
        --machine_type=n1-standard-2 \
        --job_name $(DATAFLOW_JOB_NAME)

Any ideas on how to fix this?有想法该怎么解决这个吗？ I'm using apache-beam 2.22 SDK, python 3.7我正在使用 apache-beam 2.22 SDK、python 3.7

Answer 1

Excuse me if you are referring to 2.22, because "apache-beam 1.22" seems to be old?请问您指的是 2.22，因为“apache-beam 1.22”似乎很旧？ Especially when you are using Python 3.7, you might want to try newer SDK versions such as 2.22.0 .特别是当您使用 Python 3.7 时，您可能想尝试更新的 SDK 版本，例如2.22.0 。

What I want is that if it has received events in a window, it should produce output after the window in any case.我想要的是，如果它在 window 中接收到事件，那么无论如何它都应该在 window 之后生成 output。 The source is a Cloud PubSub, with approximately 100 events/minute.源是 Cloud PubSub，每分钟大约有 100 个事件。

If you just need one pane fired per window and fixed windows every 5 mins, you can simply go with如果您只需要每 window 触发一个窗格并每 5 分钟固定 windows，您可以简单地使用 go

beam.WindowInto(beam.window.FixedWindows(5*60))

If you want to customize triggers, you can take a look at this document streaming-102 .如果你想自定义触发器，你可以看看这个文档streaming-102 。

Here is a streaming example with visualization of windowed outputs.这是一个带有窗口输出可视化的流式示例。

from apache_beam.runners.interactive import interactive_beam as ib

ib.options.capture_duration = timedelta(seconds=30)
ib.evict_captured_data()

pstreaming = beam.Pipeline(InteractiveRunner(), options=options)

words = (pstreaming
        | 'Read' >> beam.io.ReadFromPubSub(topic=topic)
        | 'Window' >> beam.WindowInto(beam.window.FixedWindows(5)))


ib.show(words, visualize_data=True, include_window_info=True)

If you run these code in a notebook environment such as jupyterlab, you get to debug streaming pipelines with outputs like this .如果您在笔记本环境（如 jupyterlab）中运行这些代码，您将可以使用类似这样的输出来调试流式传输管道。 Note the windows are visualized, for a period of 30 seconds, we get 6 windows as the fixed window is set to 5 seconds.注意 windows 是可视化的，在 30 秒的时间内，我们得到 6 windows，因为固定的 window 设置为 5 秒。 You can bin data by windows to see what data came in which window.可以按windows对数据进行分箱，看看window里面有哪些数据。

You can setup your own notebook runtime following instructions ;您可以按照说明设置自己的笔记本运行时； Or you can use hosted solutions provided by Google Dataflow Notebooks .或者您可以使用Google Dataflow Notebooks提供的托管解决方案。

数据流作业不产生任何结果 output

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-06-11 17:51:49

数据流作业不产生任何结果 output

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-06-11 17:51:49

解决方案1
0 已采纳 2020-06-11 17:51:49