简体   繁体   中英

Python BigQuery Storage Write retry strategy when writing to default stream

I'm testing python-bigquery-storage to insert multiple items into a table using the _default stream.

I used the example shown in the official docs as a basis, and modified it to use the default stream.

Here is a minimal example that's similar to what I'm trying to do:

customer_record.proto

syntax = "proto2";

message CustomerRecord {
  optional string customer_name = 1;
  optional int64 row_num = 2;
}

append_rows_default.py

from itertools import islice

from google.cloud import bigquery_storage_v1
from google.cloud.bigquery_storage_v1 import types
from google.cloud.bigquery_storage_v1 import writer
from google.protobuf import descriptor_pb2

import customer_record_pb2

import logging
logging.basicConfig(level=logging.DEBUG)

CHUNK_SIZE = 2 # Maximum number of rows to use in each AppendRowsRequest.

def chunks(l, n):
    """Yield successive `n`-sized chunks from `l`."""
    _it = iter(l)
    while True:
        chunk = [*islice(_it, 0, n)]
        if chunk:
            yield chunk
        else:
            break

def create_stream_manager(project_id, dataset_id, table_id, write_client):
    # Use the default stream
    # The stream name is:
    # projects/{project}/datasets/{dataset}/tables/{table}/_default

    parent = write_client.table_path(project_id, dataset_id, table_id)
    stream_name = f'{parent}/_default'

    # Create a template with fields needed for the first request.
    request_template = types.AppendRowsRequest()

    # The initial request must contain the stream name.
    request_template.write_stream = stream_name

    # So that BigQuery knows how to parse the serialized_rows, generate a
    # protocol buffer representation of our message descriptor.
    proto_schema = types.ProtoSchema()
    proto_descriptor = descriptor_pb2.DescriptorProto()
    customer_record_pb2.CustomerRecord.DESCRIPTOR.CopyToProto(proto_descriptor)
    proto_schema.proto_descriptor = proto_descriptor
    proto_data = types.AppendRowsRequest.ProtoData()
    proto_data.writer_schema = proto_schema
    request_template.proto_rows = proto_data

    # Create an AppendRowsStream using the request template created above.
    append_rows_stream = writer.AppendRowsStream(write_client, request_template)

    return append_rows_stream

def send_rows_to_bq(project_id, dataset_id, table_id, write_client, rows):

    append_rows_stream = create_stream_manager(project_id, dataset_id, table_id, write_client)

    response_futures = []

    row_count = 0

    # Send the rows in chunks, to limit memory usage.

    for chunk in chunks(rows, CHUNK_SIZE):

        proto_rows = types.ProtoRows()
        for row in chunk:
            row_count += 1
            proto_rows.serialized_rows.append(row.SerializeToString())

        # Create an append row request containing the rows
        request = types.AppendRowsRequest()
        proto_data = types.AppendRowsRequest.ProtoData()
        proto_data.rows = proto_rows
        request.proto_rows = proto_data

        future = append_rows_stream.send(request)

        response_futures.append(future)

    # Wait for all the append row requests to finish.
    for f in response_futures:
        f.result()

    # Shutdown background threads and close the streaming connection.
    append_rows_stream.close()

    return row_count

def create_row(row_num: int, name: str):
    row = customer_record_pb2.CustomerRecord()
    row.row_num = row_num
    row.customer_name = name
    return row

def main():

    write_client = bigquery_storage_v1.BigQueryWriteClient()

    rows = [ create_row(i, f"Test{i}") for i in range(0,20) ]

    send_rows_to_bq("PROJECT_NAME", "DATASET_NAME", "TABLE_NAME", write_client, rows)

if __name__ == '__main__':
    main()

Note:

  • In the above, CHUNK_SIZE is 2 just for this minimal example, but, in a real situation, I used a chunk size of 5000.
  • In real usage, I have several separate streams of data that need to be processed in parallel, so I make several calls to send_rows_to_bq , one for each stream of data, using a thread pool (one thread per stream of data). (I'm assuming here that AppendRowsStream is not meant to be shared by multiple threads, but I might be wrong).

It mostly works, but I often get a mix of intermittent errors in the call to append_rows_stream 's send method:

  • google.cloud.bigquery_storage_v1.exceptions.StreamClosedError: This manager has been closed and can not be used.
  • google.api_core.exceptions.Unknown: None There was a problem opening the stream. Try turning on DEBUG level logs to see the error.

I think I just need to retry on these errors, but I'm not sure how to best implement a retry strategy here. My impression is that I need to use the following strategy to retry errors when calling send :

  • If the error is a StreamClosedError , the append_rows_stream stream manager can't be used anymore, and so I need to call close on it and then call my create_stream_manager again to create a new one, then try to call send on the new stream manager.
  • Otherwise, on any google.api_core.exceptions.ServerError error, retry the call to send on the same stream manager.

Am I approaching this correctly?

Thank you.

The best solution to this problem is to update to the newer lib release .

This problem happens or was happening in the older versions because once the connection write API reaches 10MB, it hangs.

If the update to the newer lib does not work you can try these options:

  • Limit the connection to < 10MB.
  • Disconnect and connect again to the API.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM