如何將嵌套的 proto.Message 與 BigQuery Storage API Writer python 客戶端一起使用？

Question

基於來自https://github.com/googleapis/python-bigquery-storage/issues/398的代碼段，它使用proto-plus package 來定義 python 中的 protobuff 消息，非常有幫助並且按原樣運行良好，但以防萬一嵌套消息的它不起作用。
以下改編代碼會引發錯誤： google.api_core.exceptions.InvalidArgument: 400 Invalid proto schema: BqMessage.proto: Message.nested: "._default_package.Team" is not defined. 如果消息是嵌套的，則調用await bq_write_client.append_rows(iter([append_row_request]))時。

PS 我知道google-cloud-bigquery-storag庫通常與嵌套消息一起使用，因為使用官方代碼段https://github.com/googleapis/python-bigquery-storage/blob/main/samples/snippets/append_rows_proto2 .py有效，它使用嵌套消息，但在單獨的 .proto 文件中，該文件需要編譯步驟，不如直接在 python 中定義消息實用。

# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import json
import asyncio

import proto
from google.oauth2.service_account import Credentials
from google.protobuf.descriptor_pb2 import DescriptorProto
from google.cloud.bigquery_storage_v1beta2.types.storage import AppendRowsRequest
from google.cloud.bigquery_storage_v1beta2.types.protobuf import ProtoSchema, ProtoRows
from google.cloud.bigquery_storage_v1beta2.services.big_query_write import BigQueryWriteAsyncClient

class Team(proto.Message):
    name = proto.Field(proto.STRING, number=1)

class UserSchema(proto.Message):
    username = proto.Field(proto.STRING, number=1)
    email = proto.Field(proto.STRING, number=2)
    team = proto.Field(Team, number=3)

async def main():
    write_stream_path = BigQueryWriteAsyncClient.write_stream_path(
        "yolocommon", "test", "t_test_data", "_default")

    credentials = Credentials.from_service_account_file(filename="bigquery_config_file.json")
    bq_write_client = BigQueryWriteAsyncClient(credentials=credentials)

    proto_descriptor = DescriptorProto()
    UserSchema.pb().DESCRIPTOR.CopyToProto(proto_descriptor)
    proto_schema = ProtoSchema(proto_descriptor=proto_descriptor)

    serialized_rows = []
    data = [
        {
            "username": "Jack",
            "email": "jack@google.com",
            "nested": {
                "name": "Jack Jack"
            }
        },
        {
            "username": "mary",
            "email": "mary@google.com",
            "nested": {
                "name": "Mary Mary"
            }
        }
    ]
    for item in data:
        instance = UserSchema.from_json(payload=json.dumps(item))
        serialized_rows.append(UserSchema.serialize(instance))

    proto_data = AppendRowsRequest.ProtoData(
        rows=ProtoRows(serialized_rows=serialized_rows),
        writer_schema=proto_schema
    )

    append_row_request = AppendRowsRequest(
        write_stream=write_stream_path,
        proto_rows=proto_data
    )

    result = await bq_write_client.append_rows(iter([append_row_request]))
    async for item in result:
        print(item)


if __name__ == "__main__":
    asyncio.run(main())

更新：來自ProtoSchema的文檔：

輸入消息的描述符。 提供的描述符必須是自包含的，這樣發送的數據行就可以僅使用單個描述符進行完全解碼。 對於由多個獨立消息組成的數據行，這意味着描述符可能需要轉換為僅使用嵌套類型： https://developers.google.com/protocol-buffers/docs/proto#nested所以正確的方法是寫消息的描述是：

class UserSchema(proto.Message):
    class Team(proto.Message):
        name = proto.Field(proto.STRING, number=1)

    username = proto.Field(proto.STRING, number=1)
    email = proto.Field(proto.STRING, number=2)
    team = proto.Field(Team, number=3)

但它仍然會拋出相同的錯誤： google.api_core.exceptions.InvalidArgument: 400 Invalid proto schema: BqMessage.proto: Message.nested: "._default_package.UserSchema.Team" is not defined.

UPDATE2：問題的根源是如果 package 名稱為空， proto-plus會將_default_package附加為 package 名稱，因為這會導致另一個錯誤。 https://github.com/googleapis/proto-plus-python/blob/main/proto/_package_info.py#L40

TODO：在 protobuf 修復后恢復為空字符串作為 package 值。 當 package 為空時，基於 upb 的 protobuf 在嘗試添加到描述符池期間失敗並顯示“TypeError：無法將 proto 文件構建到描述符池：無效名稱：空部分 ()' means”。

顯然，目前不可能使用proto.Message來表示 BigQuery 表，如果它有一個嵌套字段 (STRUCT)。

Answer 1

protobuf 已修復，因此 fork 項目並更改行： https://github.com/googleapis/proto-plus-python/blob/main/proto/_package_info.py#L40

到

    package = getattr(
        proto_module, "package", module_name if module_name else ""
    )

它會起作用

Answer 2

以下模塊有助於繞過 proto-plus 中的預編譯或消息 class 定義。 https://pypi.org/project/xia-easy-proto/1.0.0/

您可以將 python object 解析並轉換為 protobuff。 希望它能有所幫助。

如何將嵌套的 proto.Message 與 BigQuery Storage API Writer python 客戶端一起使用？

問題描述

2 個解決方案

解決方案1
1 已采納 2022-09-29 12:21:25

解決方案2
-1 2022-10-01 17:45:23

如何將嵌套的 proto.Message 與 BigQuery Storage API Writer python 客戶端一起使用？

問題描述

2 個解決方案

解決方案1 1 已采納 2022-09-29 12:21:25

解決方案2 -1 2022-10-01 17:45:23

解決方案1
1 已采納 2022-09-29 12:21:25

解決方案2
-1 2022-10-01 17:45:23