如何使用 python 处理我的 PubSub 消息 Object 并将所有对象写入 Apache Beam 中的 BigQuery？

Question

I am trying to write all the elements of a Pub/Sub message (data,attributes,messageId and publish_time) to BigQuery using Apache Beam and wanted the data to be as:我正在尝试使用 Apache Beam 将 Pub/Sub 消息的所有元素（数据、属性、messageId 和 publish_time）写入 BigQuery，并希望数据为：

data数据	attr属性	key钥匙	publishTime发布时间
data数据	attr属性	key钥匙	publishTime发布时间

I am currently using the following piece of code to transform the message and wanted to save it in the table shown above:我目前正在使用以下代码来转换消息，并希望将其保存在上面显示的表格中：

( demo  
      | "Decoding Pub/Sub Message" + input_subscription >> beam.Map(lambda r : data_decoder.decode_base64(r))
      | "Parsing Pub/Sub Message" + input_subscription >> beam.Map(lambda r : data_parser.parseJsonMessage(r))
      | "Write to BigQuery Table"  + input_subscription  >> io.WriteToBigQuery('{0}:{1}'.format(project_name, dest_table_id),
                                                                        schema=schema, write_disposition=io.BigQueryDisposition.WRITE_APPEND, create_disposition = io.BigQueryDisposition.CREATE_IF_NEEDED ))

I wanted to store data in encoded way, column named as data and value as (element.data) and the values for rest of the columns.我想以编码方式存储数据，列命名为数据，值命名为（element.data），列的值为 rest。

Thanks in advance!提前致谢！

Answer 1

I hope it can help.我希望它能有所帮助。

I give an example based on your use case, I mocked a PubSubMessage with Beam in an unit test:我根据您的用例举了一个例子，我在单元测试中用Beam模拟了PubSubMessage ：

def test_beam_pubsub_to_bq(self):
        with TestPipeline() as p:
            message = PubsubMessage(
                data=b'{"test" : "value"}',
                attributes={'label': 'label'},
                message_id='message33444',
                publish_time=datetime.datetime.now()
            )

            result = (p
                      | beam.Create([message])
                      | 'Map' >> beam.Map(self.to_bq_element))
            result | "Print outputs" >> beam.Map(log_element)

    def to_bq_element(self, message: PubsubMessage):
        return {
            'data': message.data,
            'attr': json.dumps(message.attributes),
            'key': message.message_id,
            'publishTime': message.publish_time.utcnow().strftime("%Y-%m-%d %H:%M:%S")
        }

I read a PubSubMessage我读了一个PubSubMessage
Map the PubSubMessage to a Dict in order to write the element to Bigquery Map PubSubMessage到Dict以便将元素写入Bigquery

I have the following result:我有以下结果：

{
   'data': b'{"test" : "value"}', 
   'attr': '{"label": "label"}', 
   'key': 'message33444', 
   'publishTime': '2022-10-07 09:37:33'
}

For data I used a Python bytes , I think the good related type in Bigquery is BYTES : https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types对于我使用Python bytes的data ，我认为Bigquery中好的相关类型是BYTES ： https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
For att , I used a Json String , I initially had a Dict对于att ，我使用了 Json String ，我最初有一个Dict
key is a String key是一个String
publishTime is a String with timestamp as ISO format publishTime是一个String ，时间戳为ISO格式

Don't hesitate to adapt this example to fit perfectly on your need.请毫不犹豫地调整此示例以完全满足您的需要。

The result Dict element for Bigquery in the Beam transformation must match exactly the schema of the Bigquery table. Beam转换中Bigquery的结果Dict元素必须与Bigquery表的模式完全匹配。

The example with the BigqueryIO write : BigqueryIO write的例子：

def test_beam_pubsub_to_bq(self):
        with TestPipeline() as p:
            message = PubsubMessage(
                data=b'{"test" : "value"}',
                attributes={'label': 'label'},
                message_id='message33444',
                publish_time=datetime.datetime.now()
            )

            (p
               | beam.Create([message])
               | 'Map' >> beam.Map(self.to_bq_element)
               | "Print outputs" >> beam.Map(log_element)
               | "Write to BigQuery Table"  + input_subscription  >> io.WriteToBigQuery('{0}:{1}'.format(project_name, dest_table_id),
                                                                        schema=schema, write_disposition=io.BigQueryDisposition.WRITE_APPEND, create_disposition = io.BigQueryDisposition.CREATE_IF_NEEDED ))

    def to_bq_element(self, message: PubsubMessage):
        return {
            'data': message.data,
            'attr': json.dumps(message.attributes),
            'key': message.message_id,
            'publishTime': message.publish_time.utcnow().strftime("%Y-%m-%d %H:%M:%S")
        }

如何使用 python 处理我的 PubSub 消息 Object 并将所有对象写入 Apache Beam 中的 BigQuery？

问题描述

1 个解决方案

解决方案1
3 已采纳 2022-10-07 09:46:30

如何使用 python 处理我的 PubSub 消息 Object 并将所有对象写入 Apache Beam 中的 BigQuery？

问题描述

1 个解决方案

解决方案1 3 已采纳 2022-10-07 09:46:30

解决方案1
3 已采纳 2022-10-07 09:46:30