[英]Python BigQuery Storage Write retry strategy when writing to default stream
我正在測試python-bigquery-storage以使用 _default stream 將多個項目插入到表中。
我以官方文檔中的示例為基礎,修改為使用默認的stream。
這是一個與我正在嘗試做的類似的最小示例:
客戶記錄.proto
syntax = "proto2";
message CustomerRecord {
optional string customer_name = 1;
optional int64 row_num = 2;
}
append_rows_default.py
from itertools import islice
from google.cloud import bigquery_storage_v1
from google.cloud.bigquery_storage_v1 import types
from google.cloud.bigquery_storage_v1 import writer
from google.protobuf import descriptor_pb2
import customer_record_pb2
import logging
logging.basicConfig(level=logging.DEBUG)
CHUNK_SIZE = 2 # Maximum number of rows to use in each AppendRowsRequest.
def chunks(l, n):
"""Yield successive `n`-sized chunks from `l`."""
_it = iter(l)
while True:
chunk = [*islice(_it, 0, n)]
if chunk:
yield chunk
else:
break
def create_stream_manager(project_id, dataset_id, table_id, write_client):
# Use the default stream
# The stream name is:
# projects/{project}/datasets/{dataset}/tables/{table}/_default
parent = write_client.table_path(project_id, dataset_id, table_id)
stream_name = f'{parent}/_default'
# Create a template with fields needed for the first request.
request_template = types.AppendRowsRequest()
# The initial request must contain the stream name.
request_template.write_stream = stream_name
# So that BigQuery knows how to parse the serialized_rows, generate a
# protocol buffer representation of our message descriptor.
proto_schema = types.ProtoSchema()
proto_descriptor = descriptor_pb2.DescriptorProto()
customer_record_pb2.CustomerRecord.DESCRIPTOR.CopyToProto(proto_descriptor)
proto_schema.proto_descriptor = proto_descriptor
proto_data = types.AppendRowsRequest.ProtoData()
proto_data.writer_schema = proto_schema
request_template.proto_rows = proto_data
# Create an AppendRowsStream using the request template created above.
append_rows_stream = writer.AppendRowsStream(write_client, request_template)
return append_rows_stream
def send_rows_to_bq(project_id, dataset_id, table_id, write_client, rows):
append_rows_stream = create_stream_manager(project_id, dataset_id, table_id, write_client)
response_futures = []
row_count = 0
# Send the rows in chunks, to limit memory usage.
for chunk in chunks(rows, CHUNK_SIZE):
proto_rows = types.ProtoRows()
for row in chunk:
row_count += 1
proto_rows.serialized_rows.append(row.SerializeToString())
# Create an append row request containing the rows
request = types.AppendRowsRequest()
proto_data = types.AppendRowsRequest.ProtoData()
proto_data.rows = proto_rows
request.proto_rows = proto_data
future = append_rows_stream.send(request)
response_futures.append(future)
# Wait for all the append row requests to finish.
for f in response_futures:
f.result()
# Shutdown background threads and close the streaming connection.
append_rows_stream.close()
return row_count
def create_row(row_num: int, name: str):
row = customer_record_pb2.CustomerRecord()
row.row_num = row_num
row.customer_name = name
return row
def main():
write_client = bigquery_storage_v1.BigQueryWriteClient()
rows = [ create_row(i, f"Test{i}") for i in range(0,20) ]
send_rows_to_bq("PROJECT_NAME", "DATASET_NAME", "TABLE_NAME", write_client, rows)
if __name__ == '__main__':
main()
筆記:
CHUNK_SIZE
是 2 只是為了這個最小的例子,但在實際情況中,我使用了 5000 的塊大小。send_rows_to_bq
,每個 stream 數據調用一個,使用線程池(每個 stream 數據一個線程)。 (我在這里假設AppendRowsStream
並不意味着要由多個線程共享,但我可能是錯的)。 它大部分都有效,但我經常在調用append_rows_stream
的send
方法時遇到間歇性錯誤:
google.cloud.bigquery_storage_v1.exceptions.StreamClosedError: This manager has been closed and can not be used.
google.api_core.exceptions.Unknown: None There was a problem opening the stream. Try turning on DEBUG level logs to see the error.
我想我只需要重試這些錯誤,但我不確定如何在這里最好地實施重試策略。 我的印象是我需要使用以下策略在調用send
時重試錯誤:
StreamClosedError
,則無法再使用append_rows_stream
stream 管理器,因此我需要對其調用close
,然后再次調用我的create_stream_manager
以創建一個新管理器,然后嘗試在新的 stream 管理器上調用send
。google.api_core.exceptions.ServerError
錯誤時,重試調用以在同一 stream 管理器上send
。我是否正確地處理了這個問題?
謝謝你。
解決此問題的最佳方法是更新到較新的庫版本。
這個問題在舊版本中發生或正在發生,因為一旦連接寫入 API 達到 10MB,它就會掛起。
如果對較新庫的更新不起作用,您可以嘗試以下選項:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.