将 Apache Beam Tagged Output（数据流运行器）写入不同的 BQ 表

Question

似乎我在将标记的 PCollections 写入 BQ 中的多个目标表时遇到了问题。 管道执行没有错误，但没有数据被写入。

如果我在没有TaggedOutput的情况下执行管道，则 PCollection 元素会正确生成并自行正确写入 BQ 表（尽管是单个表，而不是多个表）。 所以我认为这个问题是误解了TaggedOutput的实际工作原理？

代码

我有一个进程 fn 生成标记为 output：

class ProcessFn(beam.DoFn):
    def process(self, el):
        if el > 5:
             yield TaggedOutput('more_than_5', el)
        else:
             yield TaggedOutput('less_than_5', el)

和管道：

with beam.Pipeline(options=beam_options) as p:

    # Read the table rows into a PCollection.
    results = (
        p
        | "read" >> beam.io.ReadFromBigQuery(table=args.input_table, use_standard_sql=True)
        | "process rows" >> beam.ParDo(ProcessFn()).with_outputs(
                                        'more_than_5',
                                        main='less_than_5')
    )

    results.less_than_5 | "write to bq 1" >> beam.io.WriteToBigQuery(
            'dataset.less_than_5',
            schema=less_than_5_schema,
            create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
    )


    results.more_than_5 | "write to bq 2" >> beam.io.WriteToBigQuery(
            'dataset.more_than_5',
            schema=more_than_5_schema,
            create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
    )

Answer 1

我认为问题是由于在代码中使用多个接收器获得结果的方式。

结果应作为元组检索：

results_less_than_5, result_more_than_5 = (
        p
        | "read" >> beam.io.ReadFromBigQuery(table=args.input_table, use_standard_sql=True)
        | "process rows" >> beam.ParDo(ProcessFn()).with_outputs(
                                        'more_than_5',
                                        main='less_than_5')
    )

results_less_than_5 | "write to bq 1" >> beam.io.WriteToBigQuery(
            'dataset.less_than_5',
            schema=less_than_5_schema,
            create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
    )


result_more_than_5 | "write to bq 2" >> beam.io.WriteToBigQuery(
            'dataset.more_than_5',
            schema=more_than_5_schema,
            create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
    )

你可以试试这个语法吗？

Answer 2

with_outputs(main=...)关键字用于不带TaggedOutput的产量。 在这种情况下，您可能应该编写with_outputs('more_than_5', 'less_than_5') 。 通过名称访问结果或作为元组解包应该可以工作。

将 Apache Beam Tagged Output（数据流运行器）写入不同的 BQ 表

问题描述

2 个解决方案

解决方案1
2 2022-08-24 14:24:05

解决方案2
0 2022-08-25 23:51:14

将 Apache Beam Tagged Output（数据流运行器）写入不同的 BQ 表

问题描述

2 个解决方案

解决方案1 2 2022-08-24 14:24:05

解决方案2 0 2022-08-25 23:51:14

解决方案1
2 2022-08-24 14:24:05

解决方案2
0 2022-08-25 23:51:14