简体   繁体   English

在数据流管道中动态设置 bigquery 表 id

[英]Dynamically set bigquery table id in dataflow pipeline

I have dataflow pipeline, it's in Python and this is what it is doing:我有数据流管道,它在 Python 中,这就是它正在做的事情:

  1. Read Message from PubSub.阅读来自 PubSub 的消息。 Messages are zipped protocol buffer.消息是压缩的协议缓冲区。 One Message receive on a PubSub contain multiple type of messages.在 PubSub 上接收的一条消息包含多种类型的消息。 See the protocol parent's message specification below:请参阅下面的协议父级消息规范:

     message BatchEntryPoint { /** * EntryPoint * * Description: Encapsulation message */ message EntryPoint { // Proto Message google.protobuf.Any proto = 1; // Timestamp google.protobuf.Timestamp timestamp = 4; } // Array of EntryPoint messages repeated EntryPoint entrypoints = 1; }

So, to explain a bit better, I have several protobuf messages.因此,为了更好地解释,我有几条 protobuf 消息。 Each message must be packed in the proto field of the EntryPoint message, we are sending several messages at once because of MQTT limitations, that's why we then use a repeated field pointing to EntryPoint message on BatchEntryPoint.每条消息都必须打包在 EntryPoint 消息的 proto 字段中,由于 MQTT 的限制,我们一次发送多条消息,这就是为什么我们在 BatchEntryPoint 上使用指向 EntryPoint 消息的重复字段。

  1. Parsing the received messages.解析收到的消息。

Nothing fancy here, just unzipping and unserializing the message we just read from the PubSub.这里没什么特别的,只是解压缩和反序列化我们刚刚从 PubSub 读取的消息。 to get 'humain readable' data.获取“人类可读”数据。

  1. For Loop on BatchEntryPoint to evaluate each EntryPoint messages.在 BatchEntryPoint 上循环以评估每个 EntryPoint 消息。

As Each messages on BatchEntryPoint can have different type, we need to process them differently由于 BatchEntryPoint 上的每条消息都可以有不同的类型,我们需要对它们进行不同的处理

  1. Parsed message data解析的消息数据

Doing different process to get all information I need and format it to a BigQuery readable format执行不同的过程来获取我需要的所有信息并将其格式化为 BigQuery 可读格式

  1. Write data to bigQuery将数据写入 bigQuery

This is where my 'trouble' begin, so my code work but it is very dirty in my opinion and hardly maintainable.这就是我的“麻烦”开始的地方,所以我的代码可以工作,但在我看来它非常肮脏并且难以维护。 There is two things to be aware of.有两件事需要注意。
Each message's type can be send to 3 different datasets, a r&d dataset, a dev dataset and a production dataset.每个消息的类型可以发送到 3 个不同的数据集,一个研发数据集,一个开发数据集和一个生产数据集。 let's say I have a message named System.假设我有一条名为 System. It could go to:它可以 go 到:

  • my-project:rd_dataset.system我的项目:rd_dataset.system
  • my-project:dev_dataset.system我的项目:dev_dataset.system
  • my-project:prod_dataset.system我的项目:prod_dataset.system

So this is what I am doing now:所以这就是我现在正在做的事情:

console_records | 'Write to Console BQ' >> beam.io.WriteToBigQuery(
    lambda e: 'my-project:rd_dataset.table1' if dataset_is_rd_table1(e) else (
        'my-project:dev_dataset.table1' if dataset_is_dev_table1(e) else (
        'my-project:prod_dataset.table1' if dataset_is_prod_table1(e) else (
        'my-project:rd_dataset.table2' if dataset_is_rd_table2(e) else (
        'my-project:dev_dataset.table2' if dataset_is_dev_table2(e) else (
        ...) else 0

I have more than 30 different type of messages, making more of 90 lines for inserting data to big query.我有 30 多种不同类型的消息,其中有 90 多行用于将数据插入到大查询中。

Here is what a dataset_is_..._tableX method looks like:下面是 dataset_is_..._tableX 方法的样子:

def dataset_is_rd_messagestype(element) -> bool:
""" check if env is rd for message's type message """
valid: bool = False
is_type = check_element_type(element, 'MessagesType')
if is_type:
    valid = dataset_is_rd(element)
return valid

check_element_type Check that the message has the right type (ex: System). check_element_type 检查消息是否具有正确的类型(例如:系统)。
dataset_is_rd looks like this: dataset_is_rd 看起来像这样:

def dataset_is_rd(element) -> bool:
    """ Check if dataset should be RD from registry id """
    if element['device_registry_id'] == 'rd':
        del element['device_registry_id']
        del element['bq_type']
        return True
    return False

The element as a key indicating us on which dataset we must send the message.该元素作为键指示我们必须在哪个数据集上发送消息。

SO this is working as expected, But I wish I could do cleaner code and maybe reduce the amount of code to change in case of adding or deleting a type of message.所以这是按预期工作的,但我希望我可以做更简洁的代码,并可能减少在添加或删除某种类型的消息时要更改的代码量。

Any ideas?有任何想法吗?

How about using TaggedOutput .如何使用TaggedOutput

Can you write something like this instead:你可以写这样的东西吗:

def dataset_type(element) -> bool:
    """ Check if dataset should be RD from registry id """
    dev_registry = element['device_registry_id']
    del element['device_registry_id']
    del element['bq_type']
    table_type = get_element_type(element, 'MessagesType')
    return 'my-project:%s_dataset.table%d' % (dev_registry, table_type)

And use that as the table lambda that you pass to BQ?并将其用作您传递给 BQ 的table lambda 吗?

So I manage to create code to insert data to dynamic table by crafting the table name dynamically.因此,我设法通过动态制作表名来创建代码以将数据插入动态表。

This is not perfect because I have to modify the element I pass to the method, however I am still very happy with the result, it has clean up my code from hundreds of line.这并不完美,因为我必须修改传递给方法的元素,但是我仍然对结果非常满意,它已经从数百行中清理了我的代码。 If I have a new table, adding it would take one line on an array compare to 6 line in the pipeline before.如果我有一个新表,添加它需要在数组中添加一行,而之前在管道中添加 6 行。

Here is my solution:这是我的解决方案:

def batch_pipeline(pipeline):
    console_message = (
            pipeline
            | 'Get console\'s message from pub/sub' >> beam.io.ReadFromPubSub(
        subscription='sub1',
        with_attributes=True)
    )
    common_message = (
            pipeline
            | 'Get common\'s message from pub/sub' >> beam.io.ReadFromPubSub(
        subscription='sub2',
        with_attributes=True)
    )
    jetson_message = (
            pipeline
            | 'Get jetson\'s message from pub/sub' >> beam.io.ReadFromPubSub(
        subscription='sub3',
        with_attributes=True)
    )

 

message = (console_message, common_message, jetson_message) | beam.Flatten()
clear_message = message | beam.ParDo(GetClearMessage())
console_bytes = clear_message | beam.ParDo(SetBytesData())
console_bytes | 'Write to big query back up table' >> beam.io.WriteToBigQuery(
    lambda e: write_to_backup(e)
)
records = clear_message | beam.ParDo(GetProtoData())
gps_records = clear_message | 'Get GPS Data' >> beam.ParDo(GetProtoData())
parsed_gps = gps_records | 'Parse GPS Data' >> beam.ParDo(ParseGps())
if parsed_gps:
    parsed_gps | 'Write to big query gps table' >> beam.io.WriteToBigQuery(
        lambda e: write_gps(e)
    )
records | 'Write to big query table' >> beam.io.WriteToBigQuery(
    lambda e: write_to_bq(e)
)

So the pipeline is reading from 3 different pub sub, extracting the data and writing to big query.因此管道正在从 3 个不同的 pub sub 读取数据,提取数据并写入大查询。

The structure of an element use by WriteToBigQuery looks like this: WriteToBigQuery 使用的元素结构如下所示:

  obj = {
        'data': data_to_write_on_bq,
        'registry_id': data_needed_to_craft_table_name,
        'gcloud_id': data_to_write_on_bq,
        'proto_type': data_needed_to_craft_table_name
  }

and then one of my method used on the lambda on WriteToBigQuery looks like this:然后我在 WriteToBigQuery 上的 lambda 上使用的一种方法如下所示:

def write_to_bq(e):
    logging.info(e)
    element = copy(e)
    registry = element['registry_id']
    logging.info(registry)
    dataset = set_dataset(registry) # set dataset name, knowing the registry, this is to set the environment (dev/prod/rd/...)
    proto_type = element['proto_type']
    logging.info('Proto Type %s', proto_type)
    table_name = reduce(lambda x, y: x + ('_' if y.isupper() else '') + y, proto_type).lower()
    full_table_name = f'my_project:{dataset}.{table_name}'
    logging.info(full_table_name)
    del e['registry_id']
    del e['proto_type']

    return full_table_name

And that's it, after 3 days of trouble !!就是这样,经过3天的麻烦!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM