在流式管道中使用 WriteToBigQuery FILE_LOADS 只会创建大量临时表（python SDK）

Question

我有一个流媒体管道，它从 pub/sub 获取消息，解析它们，然后写入 BigQuery。 挑战在于每条消息都会根据消息中的event属性进入不同的事件表，并且它们没有排序。

这意味着（我相信） WriteToBigQuery方法无法有效地批量写入，我看到它基本上一次写入每条消息，因此运行速度太慢。 我还尝试添加一个 60 秒的 window 并添加一个GroupByKey / FlatMap来尝试对它们重新排序，但在加快速度方面收效甚微。

在WriteToBigQuery中使用FILE_LOADS方法，触发频率超过 60 秒，它似乎可以工作，发送加载作业，然后（至少有时）成功，我看到数据 go 到正确的表中。 但是，创建的临时表永远不会被删除，所以我创建了数百个表（名称如beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_NAME_STEP_756_37417blahblahblah ）......这显然是不可持续的。

通过STREAMING_INSERTS写入可以正常工作，只是速度较慢，这是一种提高效率的尝试。

如果有人能帮我弄清楚为什么表没有被删除，我认为这会给我一个有效的工作管道。 我尝试了更长的触发频率（最多 1 小时），但发生了相同的行为。

这是我的主要管道 - 同样，我对它的 rest 没有任何问题，只是提供上下文。


    events, non_events = (p 
        | 'ReadData' >> beam.io.ReadFromPubSub(subscription = known_args.input_subscription).with_output_types(bytes)
        | 'decode' >> beam.Map(lambda x: x.decode('utf-8'))
        | 'Parse JSON to Dict' >> beam.Map(lambda line: json.loads(line))
        | 'FilterOutNonEvents' >> beam.ParDo(FilterOutNonEvents()).with_outputs('MAIN_OUT', 'non_events')
    )
    
    parsed, missing_tables, _ = (events
        | 'ParseDict' >> beam.ParDo(ParseDict()).with_outputs('MAIN_OUT', 'missing_tables', 'ignore')
    )
    
    results, conversion_errors = (parsed
        | 'ConvertDataTypes' >> beam.ParDo(ConvertDataTypes()).with_outputs('MAIN_OUT', 'error_data')
    )
    
    final = (results
        | 'WriteToBigQuery' >> beam.io.WriteToBigQuery(
                table = lambda record: '{project}:{dataset}.{table}'.format(project = known_args.project, dataset = known_args.dataset, table = parse_event_to_dataset_name(patterns, record["event"])),
                schema = lambda tbl: {'fields':[{'name':c.split(':')[0], 'type':c.split(':')[1]} for c in schema_json[tbl.split('.')[-1]].split(',')]},
                create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                write_disposition = beam.io.BigQueryDisposition.WRITE_APPEND,
                method = 'FILE_LOADS',
                triggering_frequency = 60
        )
    )

table arg 由消息的event属性确定， schema arg 只是全局变量的重新格式化切片（最初从 GCS 读取，同样，使用 streaming_inserts 没有问题）。

感谢任何可以提供帮助的人。 这让我很头疼（我对光束/数据流很陌生）。

Answer 1

将 LOAD_FILES 与多个分区和/或动态目标一起使用时，行为应如下所示：

'''
2. Multiple partitions and/or Dynamic Destinations:

    When there are multiple partitions of files destined for a single
    destination or when Dynamic Destinations are used, multiple load jobs
    need to be triggered for each partition/destination. Load Jobs are
    triggered to temporary tables, and those are later copied to the actual
    appropriate destination table. This ensures atomicity when only some
    of the load jobs would fail but not other. If any of them fails, then
    copy jobs are not triggered.
'''

在代码中还出现了在加载作业之后，beam 应该等待它们完成，然后从临时表中复制数据并删除它们； 然而，似乎当与流式管道一起使用时，它并没有完成这些步骤。 在我使用 DirectRunner 进行复制时，它甚至没有到达 CopyJob。 我建议在这里向 apache 梁团队报告。

N.netheless，对于您的用例，我会重新考虑使用加载作业方法，因为您可能会很快达到加载和复制作业的配额； 流式插入可能更适合这种情况，并且可能提供比每 60 秒以上的加载作业更好的插入性能

在流式管道中使用 WriteToBigQuery FILE_LOADS 只会创建大量临时表（python SDK）

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-10-26 19:16:51

在流式管道中使用 WriteToBigQuery FILE_LOADS 只会创建大量临时表（python SDK）

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-10-26 19:16:51

解决方案1
2 已采纳 2020-10-26 19:16:51