简体   繁体   English

Google Cloud Dataflow-Python将JSON流传输到PubSub-DirectRunner和DataflowRunner之间的区别

[英]Google Cloud Dataflow - Python Streaming JSON to PubSub - Differences between DirectRunner and DataflowRunner

Trying to do something conceptually simple but banging my head against the wall. 尝试做一些概念上简单的事情,但将我的头撞在墙上。

I am trying to create a streaming dataflow job in Python which consumes JSON messages from a PubSub topic/subscription, performs some basic manipulation on each message (in this case, converting the temperature from C to F) and then publishes the record back out on a different topic: 我正在尝试在Python中创建一个流数据流作业,该作业使用来自PubSub主题/订阅的JSON消息,对每条消息执行一些基本操作(在这种情况下,将温度从C转换为F),然后将记录发布回不同的主题:

from __future__ import absolute_import

import logging
import argparse
import apache_beam as beam
import apache_beam.transforms.window as window
import json

'''Normalize pubsub string to json object'''
# Lines look like this:
#{"temperature": 29.223036004820123}


def transform_temp(line):
    record = json.loads(line)
    record['temperature'] = record['temperature'] * 1.8 + 32
    return json.dumps(record)

def run(argv=None):
  """Build and run the pipeline."""

  parser = argparse.ArgumentParser()
  parser.add_argument(
      '--output_topic', required=True,
      help=('Output PubSub topic of the form '
            '"projects/<PROJECT>/topic/<TOPIC>".'))
  group = parser.add_mutually_exclusive_group(required=True)
  group.add_argument(
      '--input_topic',
      help=('Input PubSub topic of the form '
            '"projects/<PROJECT>/topics/<TOPIC>".'))
  group.add_argument(
      '--input_subscription',
      help=('Input PubSub subscription of the form '
            '"projects/<PROJECT>/subscriptions/<SUBSCRIPTION>."'))
  known_args, pipeline_args = parser.parse_known_args(argv)

  with beam.Pipeline(argv=pipeline_args) as p:
    # Read the pubsub topic into a PCollection.
    lines = ( p | beam.io.ReadStringsFromPubSub(known_args.input_topic)
                | beam.Map(transform_temp)
                | beam.io.WriteStringsToPubSub(known_args.output_topic)
              )

if __name__ == '__main__':
  logging.getLogger().setLevel(logging.INFO)
  run()

When running this code locally using the DirectRunner, everything works fine. 使用DirectRunner在本地运行此代码时,一切正常。 However, in switching to the DataflowRunner, I never see any messages published on the new topic. 但是,切换到DataflowRunner时,我再也看不到任何有关新主题的消息。

I've also tried adding some logging calls to the transform_temp function but don't see anything in the console logs in Dataflow. 我还尝试将一些日志记录调用添加到transform_temp函数,但在Dataflow的控制台日志中看不到任何内容。

Any suggestions? 有什么建议么? Btw - if I just plumb the input topic to the output topic, I can see the messages so I know the streaming is working ok. 顺便说一句-如果我只是将输入主题垂直于输出主题,那么我可以看到消息,因此我知道流媒体工作正常。

Many thanks! 非常感谢!

You might just lack a windowinto function. 您可能只是缺少windowinto函数。 Apache beam's docs state that for a streaming pipeline you are required to set either a non default window or a non default trigger. Apache Beam的文档指出,对于流管道,您需要设置非默认窗口或非默认触发器。 As you have not defined a window, you have one global window and it might thus wait endlessly on the end of the window before going to the sink. 由于尚未定义窗口,因此只有一个全局窗口,因此在进入接收器之前,它可能会在窗口末尾无休止地等待。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Dataflow 适用于 directrunner 但不适用于 dataflowrunner(PubSub 到 GCS) - Dataflow works with directrunner but not with dataflowrunner (PubSub to GCS) GCP 数据流,argparse.ArgumentError 使用 DataflowRunner 而不是 DirectRunner - GCP dataflow, argparse.ArgumentError using DataflowRunner but not DirectRunner Google Cloud Dataflow - 从 PubSub 到 Parquet - Google Cloud Dataflow - From PubSub to Parquet 带有Python的Google Cloud Dataflow - Google Cloud Dataflow with Python Google Python云数据流实例在没有新部署的情况下发生故障(pubsub导入失败) - Google Python cloud-dataflow instances broke without new deployment (failed pubsub import) 使用Python SDK进行数据流流式处理:将PubSub消息转换为BigQuery输出 - Dataflow Streaming using Python SDK: Transform for PubSub Messages to BigQuery Output 云DataFlowRunner中的python-mysql? - python-mysql in Cloud DataFlowRunner? Google Cloud Dataflow - 来自 PubSub 消息的 Pyarrow 架构 - Google Cloud Dataflow - Pyarrow schema from PubSub message Beam / Dataflow 自定义 Python 作业 - Cloud Storage 到 PubSub - Beam / Dataflow Custom Python job - Cloud Storage to PubSub Google Cloud Dataflow Python --maxNumWorkers - Google Cloud Dataflow Python --maxNumWorkers
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM