简体   繁体   中英

Apache Beam Pipeline Write to Multiple BQ tables

I have a scenario where I need to do the following:

  1. Read data from pubsub
  2. Apply multiple Transformations to the data.
  3. Persist the PCollection in multiple Google Big Query based on some config.

My question is how can I write data to multiple big query tables.

I searched for multiple bq writes using apache beam but could not find any solution

You can do that with 3 sinks, example with Beam Python :

def map1(self, element):
    ...

def map2(self, element):
    ...

def map3(self, element):
    ...

def main() -> None:
    logging.getLogger().setLevel(logging.INFO)

    your_options = PipelineOptions().view_as(YourOptions)
    pipeline_options = PipelineOptions()

    with beam.Pipeline(options=pipeline_options) as p:

        result_pcollection = (
          p 
          | 'Read from pub sub' >> ReadFromPubSub(subscription='input_subscription') 
          | 'Map 1' >> beam.Map(map1)
          | 'Map 2' >> beam.Map(map2)
          | 'Map 3' >> beam.Map(map3)
        )

        (result_pcollection |
         'Write to BQ table 1' >> beam.io.WriteToBigQuery(
                    project='project_id',
                    dataset='dataset',
                    table='table1',
                    method='STREAMING_INSERTS',
                    write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
                    create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))

        (result_pcollection |
         'Write to BQ table 2' >> beam.io.WriteToBigQuery(
                    project='project_id',
                    dataset='dataset',
                    table='table2',
                    method='STREAMING_INSERTS',
                    write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
                    create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))

        (result_pcollection_pub_sub |
         'Write to BQ table 3' >> beam.io.WriteToBigQuery(
                    project='project_id',
                    dataset='dataset',
                    table='table3',
                    method='STREAMING_INSERTS',
                    write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
                    create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))


if __name__ == "__main__":
    main()
  • The first PCollection is the result of input from PubSub .
  • I applied 3 transformations in the input PCollection
  • Sink the result to the 3 different Bigquery table
res = Flow 
=> Map 1
=> Map 2
=> Map 3

res => Sink result to BQ table 1 with `BigqueryIO`
res => Sink result to BQ table 2 with `BigqueryIO`
res => Sink result to BQ table 3 with `BigqueryIO`

In this example I used STREAMING_INSERT for ingestion to Bigquery tables, but you can adapt and change it if needed in your case.

I see the previous answers satisfy your requirement of writing the same result to multiple tables. However, I assume the below scenarios, provide a bit different pipeline.

  • Read data from PubSub
  • Filter the data based on configs (from event message keys)
  • Apply the different/same transformation to the filtered collections
  • Write results from previous collections to different BigQuery Sinks

Here, we filtered the events at early stages in the pipeline, this is helpful in:

  • Avoid processing the same event messages multiple times.
  • You can skip the messages which are not needed.
  • Apply relevant transformations to event messages.
  • Overall efficient and cost-effective system.

For example, you are processing messages from all around the world and you need to process and store the data with respect to geography - storing Europe messages in the Europe region.

Also, you need to apply transformations which are relevant to the country-specific data - add an Aadhar number to messages generated from India and Social Security number to messages generated from the USA.

And you don't want to process/store any events from specific countries - data from oceanic countries are irrelevant and not required to process/stored in our use case.

So, in this made-up example, filtering the data (based on the config) at the early stage, you will be able to store country-specific data (multiple sinks), and you don't have to process all events generated from the USA/any other region for adding an Aadhar number (event specific transformations) and you will be able to skip/drop the records or simply store them in BigQuery without applying any transformations.

If the above made-up example resembles your scenario, the sample pipeline design may look like this

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions,...
from apache_beam.io.gcp.internal.clients import bigquery

class TaggedData(beam.DoFn):
    def process(self, element):
        try:
            # filter here
            if(element["country"] == "in")
                yield {"indiaelements:taggedasindia"}
            if(element["country"] == "usa")
                yield  {"usaelements:taggedasusa"}
        
            ...
        except:
            yield {"taggedasunprocessed"}

def addAadhar(element):
    "Filtered messages - only India"
    yield "elementwithAadhar"

def addSSN(element):
    "Filtered messages - only USA"
    yield "elementwithSSN"

p = beam.Pipeline(options=options)
    
messages =  (
    p
    | "ReadFromPubSub" >> ...
    | "Tagging >> "beam.ParDo(TaggedData()).with_outputs('usa', 'india', 'oceania', ...) 
    )

india_messages = (
    messages.india 
    | "AddAdhar" >> ...
    | "WriteIndiamsgToBQ" >> streaming inserts
    )

usa_messages = (
    messages.usa
    | "AddSSN" >> ...
    | "WriteUSAmsgToBQ" >> streaming inserts
    )

oceania_messages = (
    messages.oceania
    | "DoNothing&WriteUSAmsgToBQ" >> streaming inserts
    )

deadletter = (
    (messages.unprocessed, stage1.failed, stage2.failed)
    | "CombineAllFailed" >> Flatn...
    | "WriteUnprocessed/InvalidMessagesToBQ" >> streaminginserts...
)    

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM