Python BigQuery Storage 寫入默認時寫入重試策略 stream

Question

我正在測試python-bigquery-storage以使用 _default stream 將多個項目插入到表中。

我以官方文檔中的示例為基礎，修改為使用默認的stream。

這是一個與我正在嘗試做的類似的最小示例：

客戶記錄.proto

syntax = "proto2";

message CustomerRecord {
  optional string customer_name = 1;
  optional int64 row_num = 2;
}

append_rows_default.py

from itertools import islice

from google.cloud import bigquery_storage_v1
from google.cloud.bigquery_storage_v1 import types
from google.cloud.bigquery_storage_v1 import writer
from google.protobuf import descriptor_pb2

import customer_record_pb2

import logging
logging.basicConfig(level=logging.DEBUG)

CHUNK_SIZE = 2 # Maximum number of rows to use in each AppendRowsRequest.

def chunks(l, n):
    """Yield successive `n`-sized chunks from `l`."""
    _it = iter(l)
    while True:
        chunk = [*islice(_it, 0, n)]
        if chunk:
            yield chunk
        else:
            break

def create_stream_manager(project_id, dataset_id, table_id, write_client):
    # Use the default stream
    # The stream name is:
    # projects/{project}/datasets/{dataset}/tables/{table}/_default

    parent = write_client.table_path(project_id, dataset_id, table_id)
    stream_name = f'{parent}/_default'

    # Create a template with fields needed for the first request.
    request_template = types.AppendRowsRequest()

    # The initial request must contain the stream name.
    request_template.write_stream = stream_name

    # So that BigQuery knows how to parse the serialized_rows, generate a
    # protocol buffer representation of our message descriptor.
    proto_schema = types.ProtoSchema()
    proto_descriptor = descriptor_pb2.DescriptorProto()
    customer_record_pb2.CustomerRecord.DESCRIPTOR.CopyToProto(proto_descriptor)
    proto_schema.proto_descriptor = proto_descriptor
    proto_data = types.AppendRowsRequest.ProtoData()
    proto_data.writer_schema = proto_schema
    request_template.proto_rows = proto_data

    # Create an AppendRowsStream using the request template created above.
    append_rows_stream = writer.AppendRowsStream(write_client, request_template)

    return append_rows_stream

def send_rows_to_bq(project_id, dataset_id, table_id, write_client, rows):

    append_rows_stream = create_stream_manager(project_id, dataset_id, table_id, write_client)

    response_futures = []

    row_count = 0

    # Send the rows in chunks, to limit memory usage.

    for chunk in chunks(rows, CHUNK_SIZE):

        proto_rows = types.ProtoRows()
        for row in chunk:
            row_count += 1
            proto_rows.serialized_rows.append(row.SerializeToString())

        # Create an append row request containing the rows
        request = types.AppendRowsRequest()
        proto_data = types.AppendRowsRequest.ProtoData()
        proto_data.rows = proto_rows
        request.proto_rows = proto_data

        future = append_rows_stream.send(request)

        response_futures.append(future)

    # Wait for all the append row requests to finish.
    for f in response_futures:
        f.result()

    # Shutdown background threads and close the streaming connection.
    append_rows_stream.close()

    return row_count

def create_row(row_num: int, name: str):
    row = customer_record_pb2.CustomerRecord()
    row.row_num = row_num
    row.customer_name = name
    return row

def main():

    write_client = bigquery_storage_v1.BigQueryWriteClient()

    rows = [ create_row(i, f"Test{i}") for i in range(0,20) ]

    send_rows_to_bq("PROJECT_NAME", "DATASET_NAME", "TABLE_NAME", write_client, rows)

if __name__ == '__main__':
    main()

筆記：

在上面， CHUNK_SIZE是 2 只是為了這個最小的例子，但在實際情況中，我使用了 5000 的塊大小。
在實際使用中，我有幾個需要並行處理的獨立數據流，因此我多次調用send_rows_to_bq ，每個 stream 數據調用一個，使用線程池（每個 stream 數據一個線程）。 （我在這里假設AppendRowsStream並不意味着要由多個線程共享，但我可能是錯的）。

它大部分都有效，但我經常在調用append_rows_stream的send方法時遇到間歇性錯誤：

google.cloud.bigquery_storage_v1.exceptions.StreamClosedError: This manager has been closed and can not be used.
google.api_core.exceptions.Unknown: None There was a problem opening the stream. Try turning on DEBUG level logs to see the error.

我想我只需要重試這些錯誤，但我不確定如何在這里最好地實施重試策略。 我的印象是我需要使用以下策略在調用send時重試錯誤：

如果錯誤是StreamClosedError ，則無法再使用append_rows_stream stream 管理器，因此我需要對其調用close ，然后再次調用我的create_stream_manager以創建一個新管理器，然后嘗試在新的 stream 管理器上調用send 。
否則，在出現任何google.api_core.exceptions.ServerError錯誤時，重試調用以在同一 stream 管理器上send 。

我是否正確地處理了這個問題？

謝謝你。

Answer 1

解決此問題的最佳方法是更新到較新的庫版本。

這個問題在舊版本中發生或正在發生，因為一旦連接寫入 API 達到 10MB，它就會掛起。

如果對較新庫的更新不起作用，您可以嘗試以下選項：

將連接限制為 < 10MB。
斷開並重新連接到 API。

Python BigQuery Storage 寫入默認時寫入重試策略 stream

問題描述

1 個解決方案

解決方案1
1 2022-02-22 16:11:22

Python BigQuery Storage 寫入默認時寫入重試策略 stream

問題描述

1 個解決方案

解決方案1 1 2022-02-22 16:11:22

解決方案1
1 2022-02-22 16:11:22