简体   繁体   English

如何使用 python 处理我的 PubSub 消息 Object 并将所有对象写入 Apache Beam 中的 BigQuery?

[英]How to Process my PubSub Message Object and Write all objects into BigQuery in Apache Beam using python?

I am trying to write all the elements of a Pub/Sub message (data,attributes,messageId and publish_time) to BigQuery using Apache Beam and wanted the data to be as:我正在尝试使用 Apache Beam 将 Pub/Sub 消息的所有元素(数据、属性、messageId 和 publish_time)写入 BigQuery,并希望数据为:

data数据 attr属性 key钥匙 publishTime发布时间
data数据 attr属性 key钥匙 publishTime发布时间

I am currently using the following piece of code to transform the message and wanted to save it in the table shown above:我目前正在使用以下代码来转换消息,并希望将其保存在上面显示的表格中:

( demo  
      | "Decoding Pub/Sub Message" + input_subscription >> beam.Map(lambda r : data_decoder.decode_base64(r))
      | "Parsing Pub/Sub Message" + input_subscription >> beam.Map(lambda r : data_parser.parseJsonMessage(r))
      | "Write to BigQuery Table"  + input_subscription  >> io.WriteToBigQuery('{0}:{1}'.format(project_name, dest_table_id),
                                                                        schema=schema, write_disposition=io.BigQueryDisposition.WRITE_APPEND, create_disposition = io.BigQueryDisposition.CREATE_IF_NEEDED ))

I wanted to store data in encoded way, column named as data and value as (element.data) and the values for rest of the columns.我想以编码方式存储数据,列命名为数据,值命名为(element.data),列的值为 rest。

Thanks in advance!提前致谢!

I hope it can help.我希望它能有所帮助。

I give an example based on your use case, I mocked a PubSubMessage with Beam in an unit test:我根据您的用例举了一个例子,我在单元测试中用Beam模拟了PubSubMessage

def test_beam_pubsub_to_bq(self):
        with TestPipeline() as p:
            message = PubsubMessage(
                data=b'{"test" : "value"}',
                attributes={'label': 'label'},
                message_id='message33444',
                publish_time=datetime.datetime.now()
            )

            result = (p
                      | beam.Create([message])
                      | 'Map' >> beam.Map(self.to_bq_element))
            result | "Print outputs" >> beam.Map(log_element)

    def to_bq_element(self, message: PubsubMessage):
        return {
            'data': message.data,
            'attr': json.dumps(message.attributes),
            'key': message.message_id,
            'publishTime': message.publish_time.utcnow().strftime("%Y-%m-%d %H:%M:%S")
        }
  • I read a PubSubMessage我读了一个PubSubMessage
  • Map the PubSubMessage to a Dict in order to write the element to Bigquery Map PubSubMessageDict以便将元素写入Bigquery

I have the following result:我有以下结果:

{
   'data': b'{"test" : "value"}', 
   'attr': '{"label": "label"}', 
   'key': 'message33444', 
   'publishTime': '2022-10-07 09:37:33'
}

Don't hesitate to adapt this example to fit perfectly on your need.请毫不犹豫地调整此示例以完全满足您的需要。

The result Dict element for Bigquery in the Beam transformation must match exactly the schema of the Bigquery table. Beam转换中Bigquery的结果Dict元素必须与Bigquery表的模式完全匹配。

The example with the BigqueryIO write : BigqueryIO write的例子:

def test_beam_pubsub_to_bq(self):
        with TestPipeline() as p:
            message = PubsubMessage(
                data=b'{"test" : "value"}',
                attributes={'label': 'label'},
                message_id='message33444',
                publish_time=datetime.datetime.now()
            )

            (p
               | beam.Create([message])
               | 'Map' >> beam.Map(self.to_bq_element)
               | "Print outputs" >> beam.Map(log_element)
               | "Write to BigQuery Table"  + input_subscription  >> io.WriteToBigQuery('{0}:{1}'.format(project_name, dest_table_id),
                                                                        schema=schema, write_disposition=io.BigQueryDisposition.WRITE_APPEND, create_disposition = io.BigQueryDisposition.CREATE_IF_NEEDED ))

    def to_bq_element(self, message: PubsubMessage):
        return {
            'data': message.data,
            'attr': json.dumps(message.attributes),
            'key': message.message_id,
            'publishTime': message.publish_time.utcnow().strftime("%Y-%m-%d %H:%M:%S")
        }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache Beam 使用 Go 写入 PubSub 消息 - Apache Beam Write PubSub messages using Go Apache Beam - 基于一个字段向多个 BigQuery 表发送 PubSub 消息 - Apache Beam - PubSub Message to several BigQuery tables based on a field 使用 Golang 读取 Google Cloud Pubsub 消息并写入 BigQuery - Read Google Cloud Pubsub message and write to BigQuery using Golang Apache Beam - 将 BigQuery TableRow 写入 Cassandra - Apache Beam - Write BigQuery TableRow to Cassandra 如何使用 Dataflow 在 Apache Beam 中使用 CoGroupByKey 接收器到 BigQuery - How to use CoGroupByKey sink to BigQuery in Apache Beam using Dataflow 使用数据融合将 pubsub 消息摄取到 Bigquery - pubsub message ingestion into Bigquery using data fusion 如何 append 不同的 PubSub 对象并将它们展平以将它们作为单个 JSON 一起写入 bigquery? - How to append different PubSub objects and flatten them to write them altogether into bigquery as a single JSON? 无法使用 Dataflow Apache Beam 沉入 BigQuery - Can not sink to BigQuery using Dataflow Apache Beam 使用 Apache Beam Dataflow Redshift 到 BigQuery - Redshift to BigQuery using Apache Beam Dataflow 如何使用java中的Apache Beam直达写入BigTable? - How to write to BigTable using Apache Beam direct-runner in java?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM