简体   繁体   中英

What to return from apache beam pcollection to write to bigquery

I am reading beam documentation and some of stackoverflow questions/ answers in order to understand how would i write a pubsub message to bigquery. As of now, I have working example of getting protobuf messages and able to decode them. the code looks like this

(p
 | 'ReadData' >> apache_beam.io.ReadFromPubSub(topic=known_args.input_topic, with_attributes=True)
 | 'ParsePubsubMessage' >> apache_beam.Map(parse_pubsubmessage)
 )

Eventually, what i want to do is write decoded pub-sub message to bigquery. all attribtues (and decoded byte data) will have one-to-one column mapping.

So what is confusing me is what should my parse_pubsubmessage return. As of now, It is returning a custom class which has all fields ie,

class DecodedPubsubMessage:
    def __init__(self, attr, event):
        self.attribute_one = attr['attribute_one']
        self.attribute_two = attr['attribute_two']

        self.order_id = event.order.order_id
        self.sku = event.item.item_id
        self.triggered_at = event.timestamp
        self.status = event.order.status

Is this correct approach to do this dataflow? What i was thinking that i will use this returned value to write to bigquery but due to advance python feature, i am unable to understand how to. Here is a reference example that i was looking at. From this example, I am not sure how would i do the lambda map on my returned object to write to bigquery.

Your class must inherit from DoFn and overload the "process" method and not do the transformation on init

and after the transformation you can use the "return [obj]" or "yield obj" to return the desired output PCollection

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM