简体   繁体   English

如何将嵌套的 proto.Message 与 BigQuery Storage API Writer python 客户端一起使用?

[英]How to use nested proto.Message with BigQuery Storage API Writer python client?

Based on the snippet from https://github.com/googleapis/python-bigquery-storage/issues/398 which uses proto-plus package to define protobuff message in python, is very helpful and works well as it is, but in case of the nested message it does not work.基于来自https://github.com/googleapis/python-bigquery-storage/issues/398的代码段,它使用proto-plus package 来定义 python 中的 protobuff 消息,非常有帮助并且按原样运行良好,但以防万一嵌套消息的它不起作用。
The below adapted code throws the error: google.api_core.exceptions.InvalidArgument: 400 Invalid proto schema: BqMessage.proto: Message.nested: "._default_package.Team" is not defined.以下改编代码会引发错误: google.api_core.exceptions.InvalidArgument: 400 Invalid proto schema: BqMessage.proto: Message.nested: "._default_package.Team" is not defined. when calling await bq_write_client.append_rows(iter([append_row_request])) if the message is nested.如果消息是嵌套的,则调用await bq_write_client.append_rows(iter([append_row_request]))时。

PS I know that the google-cloud-bigquery-storag library works with the nested messages in general because using the official snippet https://github.com/googleapis/python-bigquery-storage/blob/main/samples/snippets/append_rows_proto2.py works and it uses the nested message but in a separate.proto file which needs a compilation step and is not as practical as defining message directly in python. PS 我知道google-cloud-bigquery-storag库通常与嵌套消息一起使用,因为使用官方代码段https://github.com/googleapis/python-bigquery-storage/blob/main/samples/snippets/append_rows_proto2 .py有效,它使用嵌套消息,但在单独的 .proto 文件中,该文件需要编译步骤,不如直接在 python 中定义消息实用。

# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import json
import asyncio

import proto
from google.oauth2.service_account import Credentials
from google.protobuf.descriptor_pb2 import DescriptorProto
from google.cloud.bigquery_storage_v1beta2.types.storage import AppendRowsRequest
from google.cloud.bigquery_storage_v1beta2.types.protobuf import ProtoSchema, ProtoRows
from google.cloud.bigquery_storage_v1beta2.services.big_query_write import BigQueryWriteAsyncClient

class Team(proto.Message):
    name = proto.Field(proto.STRING, number=1)

class UserSchema(proto.Message):
    username = proto.Field(proto.STRING, number=1)
    email = proto.Field(proto.STRING, number=2)
    team = proto.Field(Team, number=3)

async def main():
    write_stream_path = BigQueryWriteAsyncClient.write_stream_path(
        "yolocommon", "test", "t_test_data", "_default")

    credentials = Credentials.from_service_account_file(filename="bigquery_config_file.json")
    bq_write_client = BigQueryWriteAsyncClient(credentials=credentials)

    proto_descriptor = DescriptorProto()
    UserSchema.pb().DESCRIPTOR.CopyToProto(proto_descriptor)
    proto_schema = ProtoSchema(proto_descriptor=proto_descriptor)

    serialized_rows = []
    data = [
        {
            "username": "Jack",
            "email": "jack@google.com",
            "nested": {
                "name": "Jack Jack"
            }
        },
        {
            "username": "mary",
            "email": "mary@google.com",
            "nested": {
                "name": "Mary Mary"
            }
        }
    ]
    for item in data:
        instance = UserSchema.from_json(payload=json.dumps(item))
        serialized_rows.append(UserSchema.serialize(instance))

    proto_data = AppendRowsRequest.ProtoData(
        rows=ProtoRows(serialized_rows=serialized_rows),
        writer_schema=proto_schema
    )

    append_row_request = AppendRowsRequest(
        write_stream=write_stream_path,
        proto_rows=proto_data
    )

    result = await bq_write_client.append_rows(iter([append_row_request]))
    async for item in result:
        print(item)


if __name__ == "__main__":
    asyncio.run(main())

UPDATE: From ProtoSchema 's documentation:更新:来自ProtoSchema的文档:

Descriptor for input message.输入消息的描述符。 The provided descriptor must be self contained, such that data rows sent can be fully decoded using only the single descriptor.提供的描述符必须是自包含的,这样发送的数据行就可以仅使用单个描述符进行完全解码。 For data rows that are compositions of multiple independent messages, this means the descriptor may need to be transformed to only use nested types: https://developers.google.com/protocol-buffers/docs/proto#nested So the right way to write message's description is:对于由多个独立消息组成的数据行,这意味着描述符可能需要转换为仅使用嵌套类型: https://developers.google.com/protocol-buffers/docs/proto#nested所以正确的方法是写消息的描述是:

class UserSchema(proto.Message):
    class Team(proto.Message):
        name = proto.Field(proto.STRING, number=1)

    username = proto.Field(proto.STRING, number=1)
    email = proto.Field(proto.STRING, number=2)
    team = proto.Field(Team, number=3)

But it still throws the same error: google.api_core.exceptions.InvalidArgument: 400 Invalid proto schema: BqMessage.proto: Message.nested: "._default_package.UserSchema.Team" is not defined.但它仍然会抛出相同的错误: google.api_core.exceptions.InvalidArgument: 400 Invalid proto schema: BqMessage.proto: Message.nested: "._default_package.UserSchema.Team" is not defined.

UPDATE2: The base of the issue is that proto-plus appends _default_package as a package name if the package name is empty because that causes another error. UPDATE2:问题的根源是如果 package 名称为空, proto-plus会将_default_package附加为 package 名称,因为这会导致另一个错误。 https://github.com/googleapis/proto-plus-python/blob/main/proto/_package_info.py#L40 https://github.com/googleapis/proto-plus-python/blob/main/proto/_package_info.py#L40

TODO: Revert to empty string as a package value after protobuf fix. TODO:在 protobuf 修复后恢复为空字符串作为 package 值。 When package is empty, upb based protobuf fails with an "TypeError: Couldn't build proto file into descriptor pool: invalid name: empty part ()' means" during an attempt to add to descriptor pool.当 package 为空时,基于 upb 的 protobuf 在尝试添加到描述符池期间失败并显示“TypeError:无法将 proto 文件构建到描述符池:无效名称:空部分 ()' means”。

Apparently, at the moment it is not possible to use the proto.Message to represent the BigQuery table if it has a nested field (STRUCT).显然,目前不可能使用proto.Message来表示 BigQuery 表,如果它有一个嵌套字段 (STRUCT)。

protobuf fixed so fork the project and change the line: https://github.com/googleapis/proto-plus-python/blob/main/proto/_package_info.py#L40 protobuf 已修复,因此 fork 项目并更改行: https://github.com/googleapis/proto-plus-python/blob/main/proto/_package_info.py#L40

to

    package = getattr(
        proto_module, "package", module_name if module_name else ""
    )

And it will work它会起作用

The following module help to bypass pre-compile or message class definition in proto-plus.以下模块有助于绕过 proto-plus 中的预编译或消息 class 定义。 https://pypi.org/project/xia-easy-proto/1.0.0/ https://pypi.org/project/xia-easy-proto/1.0.0/

You could just parse and transform a python object to protobuff.您可以将 python object 解析并转换为 protobuff。 Hope it could help.希望它能有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM