Apache Beam 管道寫入多個 BQ 表

Question

我有一個需要執行以下操作的場景：

從 pubsub 讀取數據
對數據應用多個轉換。
基於某些配置將 PCollection 保存在多個 Google Big Query 中。

我的問題是如何將數據寫入多個大查詢表。

我使用 apache beam 搜索了多個 bq 寫入，但找不到任何解決方案

Answer 1

您可以使用 3 個接收器來做到這一點，例如使用Beam Python ：

def map1(self, element):
    ...

def map2(self, element):
    ...

def map3(self, element):
    ...

def main() -> None:
    logging.getLogger().setLevel(logging.INFO)

    your_options = PipelineOptions().view_as(YourOptions)
    pipeline_options = PipelineOptions()

    with beam.Pipeline(options=pipeline_options) as p:

        result_pcollection = (
          p 
          | 'Read from pub sub' >> ReadFromPubSub(subscription='input_subscription') 
          | 'Map 1' >> beam.Map(map1)
          | 'Map 2' >> beam.Map(map2)
          | 'Map 3' >> beam.Map(map3)
        )

        (result_pcollection |
         'Write to BQ table 1' >> beam.io.WriteToBigQuery(
                    project='project_id',
                    dataset='dataset',
                    table='table1',
                    method='STREAMING_INSERTS',
                    write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
                    create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))

        (result_pcollection |
         'Write to BQ table 2' >> beam.io.WriteToBigQuery(
                    project='project_id',
                    dataset='dataset',
                    table='table2',
                    method='STREAMING_INSERTS',
                    write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
                    create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))

        (result_pcollection_pub_sub |
         'Write to BQ table 3' >> beam.io.WriteToBigQuery(
                    project='project_id',
                    dataset='dataset',
                    table='table3',
                    method='STREAMING_INSERTS',
                    write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
                    create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))


if __name__ == "__main__":
    main()

第一個PCollection是來自PubSub的輸入結果。
我在輸入PCollection中應用了 3 個轉換
將結果匯入 3 個不同的Bigquery表

res = Flow 
=> Map 1
=> Map 2
=> Map 3

res => Sink result to BQ table 1 with `BigqueryIO`
res => Sink result to BQ table 2 with `BigqueryIO`
res => Sink result to BQ table 3 with `BigqueryIO`

在此示例中，我使用STREAMING_INSERT提取到Bigquery表，但您可以根據需要調整和更改它。

Answer 2

我看到以前的答案滿足您將相同結果寫入多個表的要求。 但是，我假設以下情況，提供了一些不同的管道。

從 PubSub 讀取數據
根據配置過濾數據（來自事件消息鍵）
將不同/相同的轉換應用於過濾后的 collections
將之前 collections 的結果寫入不同的 BigQuery 接收器

在這里，我們過濾了管道早期階段的事件，這有助於：

避免多次處理相同的事件消息。
您可以跳過不需要的消息。
將相關轉換應用於事件消息。
整體高效且具有成本效益的系統。

例如，您正在處理來自世界各地的消息，您需要處理和存儲與地理相關的數據——將歐洲消息存儲在歐洲地區。

此外，您需要應用與國家特定數據相關的轉換——將 Aadhar 號碼添加到從印度生成的消息中，將社會安全號碼添加到從美國生成的消息中。

而且您不想處理/存儲來自特定國家/地區的任何事件 - 來自海洋國家/地區的數據無關緊要，不需要在我們的用例中處理/存儲。

因此，在這個虛構的示例中，在早期過濾數據（基於配置），您將能夠存儲特定於國家/地區的數據（多個接收器），並且您不必處理從生成的所有事件美國/任何其他地區添加 Aadhar 編號（事件特定轉換），您將能夠跳過/刪除記錄或簡單地將它們存儲在 BigQuery 中而不應用任何轉換。

如果上面的虛構示例與您的場景相似，示例管道設計可能如下所示

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions,...
from apache_beam.io.gcp.internal.clients import bigquery

class TaggedData(beam.DoFn):
    def process(self, element):
        try:
            # filter here
            if(element["country"] == "in")
                yield {"indiaelements:taggedasindia"}
            if(element["country"] == "usa")
                yield  {"usaelements:taggedasusa"}
        
            ...
        except:
            yield {"taggedasunprocessed"}

def addAadhar(element):
    "Filtered messages - only India"
    yield "elementwithAadhar"

def addSSN(element):
    "Filtered messages - only USA"
    yield "elementwithSSN"

p = beam.Pipeline(options=options)
    
messages =  (
    p
    | "ReadFromPubSub" >> ...
    | "Tagging >> "beam.ParDo(TaggedData()).with_outputs('usa', 'india', 'oceania', ...) 
    )

india_messages = (
    messages.india 
    | "AddAdhar" >> ...
    | "WriteIndiamsgToBQ" >> streaming inserts
    )

usa_messages = (
    messages.usa
    | "AddSSN" >> ...
    | "WriteUSAmsgToBQ" >> streaming inserts
    )

oceania_messages = (
    messages.oceania
    | "DoNothing&WriteUSAmsgToBQ" >> streaming inserts
    )

deadletter = (
    (messages.unprocessed, stage1.failed, stage2.failed)
    | "CombineAllFailed" >> Flatn...
    | "WriteUnprocessed/InvalidMessagesToBQ" >> streaminginserts...
)

Apache Beam 管道寫入多個 BQ 表

問題描述

2 個解決方案

解決方案1
1 2022-11-24 16:48:49

解決方案2
0 2022-11-25 08:49:34

Apache Beam 管道寫入多個 BQ 表

問題描述

2 個解決方案

解決方案1 1 2022-11-24 16:48:49

解決方案2 0 2022-11-25 08:49:34

解決方案1
1 2022-11-24 16:48:49

解決方案2
0 2022-11-25 08:49:34