如何在Google Cloud Dataflow / Apache Beam中並行運行多個WriteToBigQuery？

Question

我想從給定數據的多個事件中分離事件

{"type": "A", "k1": "v1"}
{"type": "B", "k2": "v2"}
{"type": "C", "k3": "v3"}

我想分隔type: A bigquery中的type: A事件到表A ， type:B事件到表B ， type: C事件到表C

這是我通過apache beam python sdk實現的代碼，並將數據寫入bigquery ，

A_schema = 'type:string, k1:string'
B_schema = 'type:string, k2:string'
C_schema = 'type:string, k2:string'

class ParseJsonDoFn(beam.DoFn):
    A_TYPE = 'tag_A'
    B_TYPE = 'tag_B'
    C_TYPE = 'tag_C'
    def process(self, element):
        text_line = element.trip()
        data = json.loads(text_line)

        if data['type'] == 'A':
            yield pvalue.TaggedOutput(self.A_TYPE, data)
        elif data['type'] == 'B':
            yield pvalue.TaggedOutput(self.B_TYPE, data)
        elif data['type'] == 'C':
            yield pvalue.TaggedOutput(self.C_TYPE, data)

def run():
    parser = argparse.ArgumentParser()
    parser.add_argument('--input',
                      dest='input',
                      default='data/path/data',
                      help='Input file to process.')
    known_args, pipeline_args = parser.parse_known_args(argv)
    pipeline_args.extend([
      '--runner=DirectRunner',
      '--project=project-id',
      '--job_name=seperate-bi-events-job',
    ])
    pipeline_options = PipelineOptions(pipeline_args)
    pipeline_options.view_as(SetupOptions).save_main_session = True
    with beam.Pipeline(options=pipeline_options) as p:
        lines = p | ReadFromText(known_args.input)

    multiple_lines = (
        lines
        | 'ParseJSON' >> (beam.ParDo(ParseJsonDoFn()).with_outputs(
                                      ParseJsonDoFn.A_TYPE,
                                      ParseJsonDoFn.B_TYPE,
                                      ParseJsonDoFn.C_TYPE)))

    a_line = multiple_lines.tag_A
    b_line = multiple_lines.tag_B
    c_line = multiple_lines.tag_C

    (a_line
        | "output_a" >> beam.io.WriteToBigQuery(
                                          'temp.a',
                                          schema = A_schema,
                                          write_disposition = beam.io.BigQueryDisposition.WRITE_TRUNCATE,
                                          create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED
                                        ))

    (b_line
        | "output_b" >> beam.io.WriteToBigQuery(
                                          'temp.b',
                                          schema = B_schema,
                                          write_disposition = beam.io.BigQueryDisposition.WRITE_TRUNCATE,
                                          create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED
                                        ))

    (c_line
        | "output_c" >> beam.io.WriteToBigQuery(
                                          'temp.c',
                                          schema = (C_schema),
                                          write_disposition = beam.io.BigQueryDisposition.WRITE_TRUNCATE,
                                          create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED
                                        ))

    p.run().wait_until_finish()

輸出：

INFO:root:start <DoOperation output_banner/WriteToBigQuery output_tags=['out']>
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
WARNING:root:Sleeping for 150 seconds before the write as BigQuery inserts can be routed to deleted table for 2 mins after the delete and create.
INFO:root:start <DoOperation output_banner/WriteToBigQuery output_tags=['out']>
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
WARNING:root:Sleeping for 150 seconds before the write as BigQuery inserts can be routed to deleted table for 2 mins after the delete and create.
INFO:root:start <DoOperation output_banner/WriteToBigQuery output_tags=['out']>
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
WARNING:root:Sleeping for 150 seconds before the write as BigQuery inserts can be routed to deleted table for 2 mins after the delete and create.

但是，這里有兩個問題

bigquery沒有數據？
從日志看來，代碼不是並行運行，而不是順序運行3次？

我的代碼有什么問題嗎？或者我缺少什么？

Answer 1

bigquery中沒有數據？

當將數據寫入BigQuery時，您的代碼似乎非常好（ C_schema應該是k3而不是k2 ）。 請記住，您正在流式傳輸數據，因此，如果在提交流式緩沖區中的數據之前單擊“ Preview表”按鈕，您將看不到它。 運行SELECT *查詢將顯示預期結果：

從日志看來，代碼不是並行運行，而不是順序運行3次？

好的，這很有趣。 通過在代碼中跟蹤WARNING消息，我們可以閱讀以下內容：

# if write_disposition == BigQueryDisposition.WRITE_TRUNCATE we delete
# the table before this point.
if write_disposition == BigQueryDisposition.WRITE_TRUNCATE:
  # BigQuery can route data to the old table for 2 mins max so wait
  # that much time before creating the table and writing it
  logging.warning('Sleeping for 150 seconds before the write as ' +
                  'BigQuery inserts can be routed to deleted table ' +
                  'for 2 mins after the delete and create.')
  # TODO(BEAM-2673): Remove this sleep by migrating to load api
  time.sleep(150)
  return created_table
else:
  return created_table

在閱讀BEAM-2673和BEAM-2801之后，看來這是由於BigQuery接收器存在問題，該問題與DirectRunner一起使用了Streaming API。 重新創建表時，這將導致進程休眠150 s，並且不會並行執行。

相反，如果我們在Dataflow上運行它（使用DataflowRunner ，提供分段和臨時存儲區路徑，以及從GCS加載輸入數據），那么它將並行運行三個導入作業。 在下圖中看到，所有這三個開始於22:19:45並結束於22:19:56 ：

如何在Google Cloud Dataflow / Apache Beam中並行運行多個WriteToBigQuery？

問題描述

1 個解決方案

解決方案1
2 已采納 2018-09-07 20:38:26

如何在Google Cloud Dataflow / Apache Beam中並行運行多個WriteToBigQuery？

問題描述

1 個解決方案

解決方案1 2 已采納 2018-09-07 20:38:26

解決方案1
2 已采納 2018-09-07 20:38:26