[英]How to use nested proto.Message with BigQuery Storage API Writer python client?
Based on the snippet from https://github.com/googleapis/python-bigquery-storage/issues/398 which uses proto-plus
package to define protobuff message in python, is very helpful and works well as it is, but in case of the nested message it does not work.基于来自https://github.com/googleapis/python-bigquery-storage/issues/398的代码段,它使用
proto-plus
package 来定义 python 中的 protobuff 消息,非常有帮助并且按原样运行良好,但以防万一嵌套消息的它不起作用。
The below adapted code throws the error: google.api_core.exceptions.InvalidArgument: 400 Invalid proto schema: BqMessage.proto: Message.nested: "._default_package.Team" is not defined.
以下改编代码会引发错误:
google.api_core.exceptions.InvalidArgument: 400 Invalid proto schema: BqMessage.proto: Message.nested: "._default_package.Team" is not defined.
when calling await bq_write_client.append_rows(iter([append_row_request]))
if the message is nested.如果消息是嵌套的,则调用
await bq_write_client.append_rows(iter([append_row_request]))
时。
PS I know that the google-cloud-bigquery-storag
library works with the nested messages in general because using the official snippet https://github.com/googleapis/python-bigquery-storage/blob/main/samples/snippets/append_rows_proto2.py works and it uses the nested message but in a separate.proto file which needs a compilation step and is not as practical as defining message directly in python. PS 我知道
google-cloud-bigquery-storag
库通常与嵌套消息一起使用,因为使用官方代码段https://github.com/googleapis/python-bigquery-storage/blob/main/samples/snippets/append_rows_proto2 .py有效,它使用嵌套消息,但在单独的 .proto 文件中,该文件需要编译步骤,不如直接在 python 中定义消息实用。
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import asyncio
import proto
from google.oauth2.service_account import Credentials
from google.protobuf.descriptor_pb2 import DescriptorProto
from google.cloud.bigquery_storage_v1beta2.types.storage import AppendRowsRequest
from google.cloud.bigquery_storage_v1beta2.types.protobuf import ProtoSchema, ProtoRows
from google.cloud.bigquery_storage_v1beta2.services.big_query_write import BigQueryWriteAsyncClient
class Team(proto.Message):
name = proto.Field(proto.STRING, number=1)
class UserSchema(proto.Message):
username = proto.Field(proto.STRING, number=1)
email = proto.Field(proto.STRING, number=2)
team = proto.Field(Team, number=3)
async def main():
write_stream_path = BigQueryWriteAsyncClient.write_stream_path(
"yolocommon", "test", "t_test_data", "_default")
credentials = Credentials.from_service_account_file(filename="bigquery_config_file.json")
bq_write_client = BigQueryWriteAsyncClient(credentials=credentials)
proto_descriptor = DescriptorProto()
UserSchema.pb().DESCRIPTOR.CopyToProto(proto_descriptor)
proto_schema = ProtoSchema(proto_descriptor=proto_descriptor)
serialized_rows = []
data = [
{
"username": "Jack",
"email": "jack@google.com",
"nested": {
"name": "Jack Jack"
}
},
{
"username": "mary",
"email": "mary@google.com",
"nested": {
"name": "Mary Mary"
}
}
]
for item in data:
instance = UserSchema.from_json(payload=json.dumps(item))
serialized_rows.append(UserSchema.serialize(instance))
proto_data = AppendRowsRequest.ProtoData(
rows=ProtoRows(serialized_rows=serialized_rows),
writer_schema=proto_schema
)
append_row_request = AppendRowsRequest(
write_stream=write_stream_path,
proto_rows=proto_data
)
result = await bq_write_client.append_rows(iter([append_row_request]))
async for item in result:
print(item)
if __name__ == "__main__":
asyncio.run(main())
UPDATE: From ProtoSchema
's documentation:更新:来自
ProtoSchema
的文档:
Descriptor for input message.
输入消息的描述符。 The provided descriptor must be self contained, such that data rows sent can be fully decoded using only the single descriptor.
提供的描述符必须是自包含的,这样发送的数据行就可以仅使用单个描述符进行完全解码。 For data rows that are compositions of multiple independent messages, this means the descriptor may need to be transformed to only use nested types: https://developers.google.com/protocol-buffers/docs/proto#nested So the right way to write message's description is:
对于由多个独立消息组成的数据行,这意味着描述符可能需要转换为仅使用嵌套类型: https://developers.google.com/protocol-buffers/docs/proto#nested所以正确的方法是写消息的描述是:
class UserSchema(proto.Message):
class Team(proto.Message):
name = proto.Field(proto.STRING, number=1)
username = proto.Field(proto.STRING, number=1)
email = proto.Field(proto.STRING, number=2)
team = proto.Field(Team, number=3)
But it still throws the same error: google.api_core.exceptions.InvalidArgument: 400 Invalid proto schema: BqMessage.proto: Message.nested: "._default_package.UserSchema.Team" is not defined.
但它仍然会抛出相同的错误:
google.api_core.exceptions.InvalidArgument: 400 Invalid proto schema: BqMessage.proto: Message.nested: "._default_package.UserSchema.Team" is not defined.
UPDATE2: The base of the issue is that proto-plus
appends _default_package
as a package name if the package name is empty because that causes another error. UPDATE2:问题的根源是如果 package 名称为空,
proto-plus
会将_default_package
附加为 package 名称,因为这会导致另一个错误。 https://github.com/googleapis/proto-plus-python/blob/main/proto/_package_info.py#L40 https://github.com/googleapis/proto-plus-python/blob/main/proto/_package_info.py#L40
TODO: Revert to empty string as a package value after protobuf fix.
TODO:在 protobuf 修复后恢复为空字符串作为 package 值。 When package is empty, upb based protobuf fails with an "TypeError: Couldn't build proto file into descriptor pool: invalid name: empty part ()' means" during an attempt to add to descriptor pool.
当 package 为空时,基于 upb 的 protobuf 在尝试添加到描述符池期间失败并显示“TypeError:无法将 proto 文件构建到描述符池:无效名称:空部分 ()' means”。
Apparently, at the moment it is not possible to use the proto.Message
to represent the BigQuery table if it has a nested field (STRUCT).显然,目前不可能使用
proto.Message
来表示 BigQuery 表,如果它有一个嵌套字段 (STRUCT)。
protobuf fixed so fork the project and change the line: https://github.com/googleapis/proto-plus-python/blob/main/proto/_package_info.py#L40 protobuf 已修复,因此 fork 项目并更改行: https://github.com/googleapis/proto-plus-python/blob/main/proto/_package_info.py#L40
to到
package = getattr(
proto_module, "package", module_name if module_name else ""
)
And it will work它会起作用
The following module help to bypass pre-compile or message class definition in proto-plus.以下模块有助于绕过 proto-plus 中的预编译或消息 class 定义。 https://pypi.org/project/xia-easy-proto/1.0.0/
https://pypi.org/project/xia-easy-proto/1.0.0/
You could just parse and transform a python object to protobuff.您可以将 python object 解析并转换为 protobuff。 Hope it could help.
希望它能有所帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.