Beam / Dataflow 自定义 Python 作业 - Cloud Storage 到 PubSub

Question

我需要对某些数据执行一个非常简单的转换（从 JSON 中提取一个字符串），然后将其写入 PubSub - 我正在尝试使用自定义 python Dataflow 作业来执行此操作。

我写了一个成功写回云存储的作业，但即使是最简单的写入 PubSub（无转换）的尝试也会导致错误： JOB_MESSAGE_ERROR: Workflow failed. Causes: Expected custom source to have non-zero number of splits. JOB_MESSAGE_ERROR: Workflow failed. Causes: Expected custom source to have non-zero number of splits.

有没有人通过 Dataflow 从 GCS 成功写入 PubSub？

任何人都可以阐明这里出了什么问题吗？


def run(argv=None):

  parser = argparse.ArgumentParser()
  parser.add_argument('--input',
                      dest='input',
                      help='Input file to process.')
  parser.add_argument('--output',
                      dest='output',                      
                      help='Output file to write results to.')
  known_args, pipeline_args = parser.parse_known_args(argv)

  pipeline_options = PipelineOptions(pipeline_args)
  pipeline_options.view_as(SetupOptions).save_main_session = True
  with beam.Pipeline(options=pipeline_options) as p:

    lines = p | ReadFromText(known_args.input)

    output = lines #Obviously not necessary but this is where my simple extract goes

    output | beam.io.WriteToPubSub(known_args.output) # This doesn't

Answer 1

目前无法实现这种情况，因为当您在 Dataflow 中使用流模式时，您可以使用的唯一来源是 PubSub 。 而且您无法切换到批处理模式，因为 apache 光束PubSub 源和接收器仅可用于流式传输（用于像 Dataflow runner 这样的远程执行）。

这就是为什么您可以在没有 WriteToPubSub 和流标志的情况下执行管道的原因。

Beam / Dataflow 自定义 Python 作业 - Cloud Storage 到 PubSub

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-07-31 21:07:03

Beam / Dataflow 自定义 Python 作业 - Cloud Storage 到 PubSub

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-07-31 21:07:03

解决方案1
2 已采纳 2019-07-31 21:07:03