簡體   English   中英

Python BigQuery Storage 寫入默認時寫入重試策略 stream

[英]Python BigQuery Storage Write retry strategy when writing to default stream

我正在測試python-bigquery-storage以使用 _default stream 將多個項目插入到表中。

以官方文檔中的示例為基礎,修改為使用默認的stream。

這是一個與我正在嘗試做的類似的最小示例:

客戶記錄.proto

syntax = "proto2";

message CustomerRecord {
  optional string customer_name = 1;
  optional int64 row_num = 2;
}

append_rows_default.py

from itertools import islice

from google.cloud import bigquery_storage_v1
from google.cloud.bigquery_storage_v1 import types
from google.cloud.bigquery_storage_v1 import writer
from google.protobuf import descriptor_pb2

import customer_record_pb2

import logging
logging.basicConfig(level=logging.DEBUG)

CHUNK_SIZE = 2 # Maximum number of rows to use in each AppendRowsRequest.

def chunks(l, n):
    """Yield successive `n`-sized chunks from `l`."""
    _it = iter(l)
    while True:
        chunk = [*islice(_it, 0, n)]
        if chunk:
            yield chunk
        else:
            break

def create_stream_manager(project_id, dataset_id, table_id, write_client):
    # Use the default stream
    # The stream name is:
    # projects/{project}/datasets/{dataset}/tables/{table}/_default

    parent = write_client.table_path(project_id, dataset_id, table_id)
    stream_name = f'{parent}/_default'

    # Create a template with fields needed for the first request.
    request_template = types.AppendRowsRequest()

    # The initial request must contain the stream name.
    request_template.write_stream = stream_name

    # So that BigQuery knows how to parse the serialized_rows, generate a
    # protocol buffer representation of our message descriptor.
    proto_schema = types.ProtoSchema()
    proto_descriptor = descriptor_pb2.DescriptorProto()
    customer_record_pb2.CustomerRecord.DESCRIPTOR.CopyToProto(proto_descriptor)
    proto_schema.proto_descriptor = proto_descriptor
    proto_data = types.AppendRowsRequest.ProtoData()
    proto_data.writer_schema = proto_schema
    request_template.proto_rows = proto_data

    # Create an AppendRowsStream using the request template created above.
    append_rows_stream = writer.AppendRowsStream(write_client, request_template)

    return append_rows_stream

def send_rows_to_bq(project_id, dataset_id, table_id, write_client, rows):

    append_rows_stream = create_stream_manager(project_id, dataset_id, table_id, write_client)

    response_futures = []

    row_count = 0

    # Send the rows in chunks, to limit memory usage.

    for chunk in chunks(rows, CHUNK_SIZE):

        proto_rows = types.ProtoRows()
        for row in chunk:
            row_count += 1
            proto_rows.serialized_rows.append(row.SerializeToString())

        # Create an append row request containing the rows
        request = types.AppendRowsRequest()
        proto_data = types.AppendRowsRequest.ProtoData()
        proto_data.rows = proto_rows
        request.proto_rows = proto_data

        future = append_rows_stream.send(request)

        response_futures.append(future)

    # Wait for all the append row requests to finish.
    for f in response_futures:
        f.result()

    # Shutdown background threads and close the streaming connection.
    append_rows_stream.close()

    return row_count

def create_row(row_num: int, name: str):
    row = customer_record_pb2.CustomerRecord()
    row.row_num = row_num
    row.customer_name = name
    return row

def main():

    write_client = bigquery_storage_v1.BigQueryWriteClient()

    rows = [ create_row(i, f"Test{i}") for i in range(0,20) ]

    send_rows_to_bq("PROJECT_NAME", "DATASET_NAME", "TABLE_NAME", write_client, rows)

if __name__ == '__main__':
    main()

筆記:

  • 在上面, CHUNK_SIZE是 2 只是為了這個最小的例子,但在實際情況中,我使用了 5000 的塊大小。
  • 在實際使用中,我有幾個需要並行處理的獨立數據流,因此我多次調用send_rows_to_bq ,每個 stream 數據調用一個,使用線程池(每個 stream 數據一個線程)。 (我在這里假設AppendRowsStream並不意味着要由多個線程共享,但我可能是錯的)。

它大部分都有效,但我經常在調用append_rows_streamsend方法時遇到間歇性錯誤:

  • google.cloud.bigquery_storage_v1.exceptions.StreamClosedError: This manager has been closed and can not be used.
  • google.api_core.exceptions.Unknown: None There was a problem opening the stream. Try turning on DEBUG level logs to see the error.

我想我只需要重試這些錯誤,但我不確定如何在這里最好地實施重試策略。 我的印象是我需要使用以下策略在調用send時重試錯誤:

  • 如果錯誤是StreamClosedError ,則無法再使用append_rows_stream stream 管理器,因此我需要對其調用close ,然后再次調用我的create_stream_manager以創建一個新管理器,然后嘗試在新的 stream 管理器上調用send
  • 否則,在出現任何google.api_core.exceptions.ServerError錯誤時,重試調用以在同一 stream 管理器上send

我是否正確地處理了這個問題?

謝謝你。

解決此問題的最佳方法是更新到較新的庫版本

這個問題在舊版本中發生或正在發生,因為一旦連接寫入 API 達到 10MB,它就會掛起。

如果對較新庫的更新不起作用,您可以嘗試以下選項:

  • 將連接限制為 < 10MB。
  • 斷開並重新連接到 API。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM