[英]How to Process my PubSub Message Object and Write all objects into BigQuery in Apache Beam using python?
I am trying to write all the elements of a Pub/Sub message (data,attributes,messageId and publish_time) to BigQuery using Apache Beam and wanted the data to be as:我正在尝试使用 Apache Beam 将 Pub/Sub 消息的所有元素(数据、属性、messageId 和 publish_time)写入 BigQuery,并希望数据为:
data数据 | attr属性 | key钥匙 | publishTime发布时间 |
---|---|---|---|
data数据 | attr属性 | key钥匙 | publishTime发布时间 |
I am currently using the following piece of code to transform the message and wanted to save it in the table shown above:我目前正在使用以下代码来转换消息,并希望将其保存在上面显示的表格中:
( demo
| "Decoding Pub/Sub Message" + input_subscription >> beam.Map(lambda r : data_decoder.decode_base64(r))
| "Parsing Pub/Sub Message" + input_subscription >> beam.Map(lambda r : data_parser.parseJsonMessage(r))
| "Write to BigQuery Table" + input_subscription >> io.WriteToBigQuery('{0}:{1}'.format(project_name, dest_table_id),
schema=schema, write_disposition=io.BigQueryDisposition.WRITE_APPEND, create_disposition = io.BigQueryDisposition.CREATE_IF_NEEDED ))
I wanted to store data in encoded way, column named as data and value as (element.data) and the values for rest of the columns.我想以编码方式存储数据,列命名为数据,值命名为(element.data),列的值为 rest。
Thanks in advance!提前致谢!
I hope it can help.我希望它能有所帮助。
I give an example based on your use case, I mocked a PubSubMessage
with Beam
in an unit test:我根据您的用例举了一个例子,我在单元测试中用Beam
模拟了PubSubMessage
:
def test_beam_pubsub_to_bq(self):
with TestPipeline() as p:
message = PubsubMessage(
data=b'{"test" : "value"}',
attributes={'label': 'label'},
message_id='message33444',
publish_time=datetime.datetime.now()
)
result = (p
| beam.Create([message])
| 'Map' >> beam.Map(self.to_bq_element))
result | "Print outputs" >> beam.Map(log_element)
def to_bq_element(self, message: PubsubMessage):
return {
'data': message.data,
'attr': json.dumps(message.attributes),
'key': message.message_id,
'publishTime': message.publish_time.utcnow().strftime("%Y-%m-%d %H:%M:%S")
}
PubSubMessage
我读了一个PubSubMessage
PubSubMessage
to a Dict
in order to write the element to Bigquery
Map PubSubMessage
到Dict
以便将元素写入Bigquery
I have the following result:我有以下结果:
{
'data': b'{"test" : "value"}',
'attr': '{"label": "label"}',
'key': 'message33444',
'publishTime': '2022-10-07 09:37:33'
}
For data
I used a Python
bytes
, I think the good related type in Bigquery
is BYTES
: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types对于我使用Python
bytes
的data
,我认为Bigquery
中好的相关类型是BYTES
: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
For att
, I used a Json String
, I initially had a Dict
对于att
,我使用了 Json String
,我最初有一个Dict
key
is a String
key
是一个String
publishTime
is a String
with timestamp as ISO
format publishTime
是一个String
,时间戳为ISO
格式
Don't hesitate to adapt this example to fit perfectly on your need.请毫不犹豫地调整此示例以完全满足您的需要。
The result Dict
element for Bigquery
in the Beam
transformation must match exactly the schema of the Bigquery
table. Beam
转换中Bigquery
的结果Dict
元素必须与Bigquery
表的模式完全匹配。
The example with the BigqueryIO
write
: BigqueryIO
write
的例子:
def test_beam_pubsub_to_bq(self):
with TestPipeline() as p:
message = PubsubMessage(
data=b'{"test" : "value"}',
attributes={'label': 'label'},
message_id='message33444',
publish_time=datetime.datetime.now()
)
(p
| beam.Create([message])
| 'Map' >> beam.Map(self.to_bq_element)
| "Print outputs" >> beam.Map(log_element)
| "Write to BigQuery Table" + input_subscription >> io.WriteToBigQuery('{0}:{1}'.format(project_name, dest_table_id),
schema=schema, write_disposition=io.BigQueryDisposition.WRITE_APPEND, create_disposition = io.BigQueryDisposition.CREATE_IF_NEEDED ))
def to_bq_element(self, message: PubsubMessage):
return {
'data': message.data,
'attr': json.dumps(message.attributes),
'key': message.message_id,
'publishTime': message.publish_time.utcnow().strftime("%Y-%m-%d %H:%M:%S")
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.