I have a scenario where I need to do the following:
My question is how can I write data to multiple big query tables.
I searched for multiple bq writes using apache beam but could not find any solution
You can do that with 3 sinks, example with Beam
Python
:
def map1(self, element):
...
def map2(self, element):
...
def map3(self, element):
...
def main() -> None:
logging.getLogger().setLevel(logging.INFO)
your_options = PipelineOptions().view_as(YourOptions)
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
result_pcollection = (
p
| 'Read from pub sub' >> ReadFromPubSub(subscription='input_subscription')
| 'Map 1' >> beam.Map(map1)
| 'Map 2' >> beam.Map(map2)
| 'Map 3' >> beam.Map(map3)
)
(result_pcollection |
'Write to BQ table 1' >> beam.io.WriteToBigQuery(
project='project_id',
dataset='dataset',
table='table1',
method='STREAMING_INSERTS',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
(result_pcollection |
'Write to BQ table 2' >> beam.io.WriteToBigQuery(
project='project_id',
dataset='dataset',
table='table2',
method='STREAMING_INSERTS',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
(result_pcollection_pub_sub |
'Write to BQ table 3' >> beam.io.WriteToBigQuery(
project='project_id',
dataset='dataset',
table='table3',
method='STREAMING_INSERTS',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
if __name__ == "__main__":
main()
PCollection
is the result of input from PubSub
.PCollection
Bigquery
tableres = Flow
=> Map 1
=> Map 2
=> Map 3
res => Sink result to BQ table 1 with `BigqueryIO`
res => Sink result to BQ table 2 with `BigqueryIO`
res => Sink result to BQ table 3 with `BigqueryIO`
In this example I used STREAMING_INSERT
for ingestion to Bigquery
tables, but you can adapt and change it if needed in your case.
I see the previous answers satisfy your requirement of writing the same result to multiple tables. However, I assume the below scenarios, provide a bit different pipeline.
Here, we filtered the events at early stages in the pipeline, this is helpful in:
For example, you are processing messages from all around the world and you need to process and store the data with respect to geography - storing Europe messages in the Europe region.
Also, you need to apply transformations which are relevant to the country-specific data - add an Aadhar number to messages generated from India and Social Security number to messages generated from the USA.
And you don't want to process/store any events from specific countries - data from oceanic countries are irrelevant and not required to process/stored in our use case.
So, in this made-up example, filtering the data (based on the config) at the early stage, you will be able to store country-specific data (multiple sinks), and you don't have to process all events generated from the USA/any other region for adding an Aadhar number (event specific transformations) and you will be able to skip/drop the records or simply store them in BigQuery without applying any transformations.
If the above made-up example resembles your scenario, the sample pipeline design may look like this
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions,...
from apache_beam.io.gcp.internal.clients import bigquery
class TaggedData(beam.DoFn):
def process(self, element):
try:
# filter here
if(element["country"] == "in")
yield {"indiaelements:taggedasindia"}
if(element["country"] == "usa")
yield {"usaelements:taggedasusa"}
...
except:
yield {"taggedasunprocessed"}
def addAadhar(element):
"Filtered messages - only India"
yield "elementwithAadhar"
def addSSN(element):
"Filtered messages - only USA"
yield "elementwithSSN"
p = beam.Pipeline(options=options)
messages = (
p
| "ReadFromPubSub" >> ...
| "Tagging >> "beam.ParDo(TaggedData()).with_outputs('usa', 'india', 'oceania', ...)
)
india_messages = (
messages.india
| "AddAdhar" >> ...
| "WriteIndiamsgToBQ" >> streaming inserts
)
usa_messages = (
messages.usa
| "AddSSN" >> ...
| "WriteUSAmsgToBQ" >> streaming inserts
)
oceania_messages = (
messages.oceania
| "DoNothing&WriteUSAmsgToBQ" >> streaming inserts
)
deadletter = (
(messages.unprocessed, stage1.failed, stage2.failed)
| "CombineAllFailed" >> Flatn...
| "WriteUnprocessed/InvalidMessagesToBQ" >> streaminginserts...
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.