I'm testing python-bigquery-storage to insert multiple items into a table using the _default stream.
I used the example shown in the official docs as a basis, and modified it to use the default stream.
Here is a minimal example that's similar to what I'm trying to do:
customer_record.proto
syntax = "proto2";
message CustomerRecord {
optional string customer_name = 1;
optional int64 row_num = 2;
}
append_rows_default.py
from itertools import islice
from google.cloud import bigquery_storage_v1
from google.cloud.bigquery_storage_v1 import types
from google.cloud.bigquery_storage_v1 import writer
from google.protobuf import descriptor_pb2
import customer_record_pb2
import logging
logging.basicConfig(level=logging.DEBUG)
CHUNK_SIZE = 2 # Maximum number of rows to use in each AppendRowsRequest.
def chunks(l, n):
"""Yield successive `n`-sized chunks from `l`."""
_it = iter(l)
while True:
chunk = [*islice(_it, 0, n)]
if chunk:
yield chunk
else:
break
def create_stream_manager(project_id, dataset_id, table_id, write_client):
# Use the default stream
# The stream name is:
# projects/{project}/datasets/{dataset}/tables/{table}/_default
parent = write_client.table_path(project_id, dataset_id, table_id)
stream_name = f'{parent}/_default'
# Create a template with fields needed for the first request.
request_template = types.AppendRowsRequest()
# The initial request must contain the stream name.
request_template.write_stream = stream_name
# So that BigQuery knows how to parse the serialized_rows, generate a
# protocol buffer representation of our message descriptor.
proto_schema = types.ProtoSchema()
proto_descriptor = descriptor_pb2.DescriptorProto()
customer_record_pb2.CustomerRecord.DESCRIPTOR.CopyToProto(proto_descriptor)
proto_schema.proto_descriptor = proto_descriptor
proto_data = types.AppendRowsRequest.ProtoData()
proto_data.writer_schema = proto_schema
request_template.proto_rows = proto_data
# Create an AppendRowsStream using the request template created above.
append_rows_stream = writer.AppendRowsStream(write_client, request_template)
return append_rows_stream
def send_rows_to_bq(project_id, dataset_id, table_id, write_client, rows):
append_rows_stream = create_stream_manager(project_id, dataset_id, table_id, write_client)
response_futures = []
row_count = 0
# Send the rows in chunks, to limit memory usage.
for chunk in chunks(rows, CHUNK_SIZE):
proto_rows = types.ProtoRows()
for row in chunk:
row_count += 1
proto_rows.serialized_rows.append(row.SerializeToString())
# Create an append row request containing the rows
request = types.AppendRowsRequest()
proto_data = types.AppendRowsRequest.ProtoData()
proto_data.rows = proto_rows
request.proto_rows = proto_data
future = append_rows_stream.send(request)
response_futures.append(future)
# Wait for all the append row requests to finish.
for f in response_futures:
f.result()
# Shutdown background threads and close the streaming connection.
append_rows_stream.close()
return row_count
def create_row(row_num: int, name: str):
row = customer_record_pb2.CustomerRecord()
row.row_num = row_num
row.customer_name = name
return row
def main():
write_client = bigquery_storage_v1.BigQueryWriteClient()
rows = [ create_row(i, f"Test{i}") for i in range(0,20) ]
send_rows_to_bq("PROJECT_NAME", "DATASET_NAME", "TABLE_NAME", write_client, rows)
if __name__ == '__main__':
main()
Note:
CHUNK_SIZE
is 2 just for this minimal example, but, in a real situation, I used a chunk size of 5000.send_rows_to_bq
, one for each stream of data, using a thread pool (one thread per stream of data). (I'm assuming here that AppendRowsStream
is not meant to be shared by multiple threads, but I might be wrong). It mostly works, but I often get a mix of intermittent errors in the call to append_rows_stream
's send
method:
google.cloud.bigquery_storage_v1.exceptions.StreamClosedError: This manager has been closed and can not be used.
google.api_core.exceptions.Unknown: None There was a problem opening the stream. Try turning on DEBUG level logs to see the error.
I think I just need to retry on these errors, but I'm not sure how to best implement a retry strategy here. My impression is that I need to use the following strategy to retry errors when calling send
:
StreamClosedError
, the append_rows_stream
stream manager can't be used anymore, and so I need to call close
on it and then call my create_stream_manager
again to create a new one, then try to call send
on the new stream manager.google.api_core.exceptions.ServerError
error, retry the call to send
on the same stream manager.Am I approaching this correctly?
Thank you.
The best solution to this problem is to update to the newer lib release .
This problem happens or was happening in the older versions because once the connection write API reaches 10MB, it hangs.
If the update to the newer lib does not work you can try these options:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.