[英]Google Cloud Dataflow - Python Streaming JSON to PubSub - Differences between DirectRunner and DataflowRunner
Trying to do something conceptually simple but banging my head against the wall. 尝试做一些概念上简单的事情,但将我的头撞在墙上。
I am trying to create a streaming dataflow job in Python which consumes JSON messages from a PubSub topic/subscription, performs some basic manipulation on each message (in this case, converting the temperature from C to F) and then publishes the record back out on a different topic: 我正在尝试在Python中创建一个流数据流作业,该作业使用来自PubSub主题/订阅的JSON消息,对每条消息执行一些基本操作(在这种情况下,将温度从C转换为F),然后将记录发布回不同的主题:
from __future__ import absolute_import
import logging
import argparse
import apache_beam as beam
import apache_beam.transforms.window as window
import json
'''Normalize pubsub string to json object'''
# Lines look like this:
#{"temperature": 29.223036004820123}
def transform_temp(line):
record = json.loads(line)
record['temperature'] = record['temperature'] * 1.8 + 32
return json.dumps(record)
def run(argv=None):
"""Build and run the pipeline."""
parser = argparse.ArgumentParser()
parser.add_argument(
'--output_topic', required=True,
help=('Output PubSub topic of the form '
'"projects/<PROJECT>/topic/<TOPIC>".'))
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument(
'--input_topic',
help=('Input PubSub topic of the form '
'"projects/<PROJECT>/topics/<TOPIC>".'))
group.add_argument(
'--input_subscription',
help=('Input PubSub subscription of the form '
'"projects/<PROJECT>/subscriptions/<SUBSCRIPTION>."'))
known_args, pipeline_args = parser.parse_known_args(argv)
with beam.Pipeline(argv=pipeline_args) as p:
# Read the pubsub topic into a PCollection.
lines = ( p | beam.io.ReadStringsFromPubSub(known_args.input_topic)
| beam.Map(transform_temp)
| beam.io.WriteStringsToPubSub(known_args.output_topic)
)
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
When running this code locally using the DirectRunner, everything works fine. 使用DirectRunner在本地运行此代码时,一切正常。 However, in switching to the DataflowRunner, I never see any messages published on the new topic.
但是,切换到DataflowRunner时,我再也看不到任何有关新主题的消息。
I've also tried adding some logging calls to the transform_temp function but don't see anything in the console logs in Dataflow. 我还尝试将一些日志记录调用添加到transform_temp函数,但在Dataflow的控制台日志中看不到任何内容。
Any suggestions? 有什么建议么? Btw - if I just plumb the input topic to the output topic, I can see the messages so I know the streaming is working ok.
顺便说一句-如果我只是将输入主题垂直于输出主题,那么我可以看到消息,因此我知道流媒体工作正常。
Many thanks! 非常感谢!
You might just lack a windowinto function. 您可能只是缺少windowinto函数。 Apache beam's docs state that for a streaming pipeline you are required to set either a non default window or a non default trigger.
Apache Beam的文档指出,对于流管道,您需要设置非默认窗口或非默认触发器。 As you have not defined a window, you have one global window and it might thus wait endlessly on the end of the window before going to the sink.
由于尚未定义窗口,因此只有一个全局窗口,因此在进入接收器之前,它可能会在窗口末尾无休止地等待。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.