简体   繁体   中英

issue in dataflow that writes to multiple tables in bigquery

I use the code from https://partly-cloudy.co.uk/2021/02/05/dataflow-pipeline-to-ingest-into-multiple-bigquery-tables-using-dynamic-destination-and-side-input/

and change it to:

def getFullTableName(pn,tn):
    return "{0}:{1}".format(pn,tn)
...
(
  pipeline | "Read Data From Input Topic" >> beam.io.ReadFromPubSub(topic=data_topic)
           | "Get Table data from input row" >> beam.Map(lambda r : data_ingestion.getData(r))
           | "Write to BigQuery Table" >> beam.io.WriteToBigQuery(table = lambda project_name, dest_table_id : getFullTableName(project_name,dest_table_id),                                                                            schema = lambda table, schema_coll : schema_coll[table],                                                                            schema_side_inputs=(schema_coll,),                                                                             create_disposition='CREATE_IF_NEEDED',                                                                           write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
        )

when running this I get error:

Error message from worker: generic::unknown: Traceback (most recent call last): File "apache_beam/runners/common.py", line 1198, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 718, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 843, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery.py", line 1658, in process schema = self.schema(destination, *schema_side_inputs) File "multi_table_stream_dyndest.py", line 62, in KeyError: 'None:ticketing.test2' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 267, in _execute response = task() File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 340, in Z945F3FC449518A73B9F 5F32868DB466CZ: self.create_worker().do_instruction(request), request) File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 580, in do_instruction return getattr(self, request_type)( File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 618, in process_bundle bundle_processor.process_bundle(instruction_id)) File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 995, in process_bundle input_op_by_transform_id[element.transform_id].process_encoded( File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 221, in process_encoded self.output(decoded_value) File "apache_beam/runners/worker/operations.py", line 346, in apache_beam.runners.worker.operations.Operation.output File "apache_beam/runners/worker/operations.py", line 348, in apache_beam.runners.worker.operations.Operation.Z78E6221F6393D1356681 DB398F14CE6DZ File "apache_beam/runners/worker/operations.py", line 215, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 707, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 708, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1200, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1281, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1198, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 718, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 843, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery.py", li ne 1658, in process schema = self.schema(destination, *schema_side_inputs) File "multi_table_stream_dyndest.py", line 62, in KeyError: " None :ticketing.test2 [while running 'Write to BigQuery Table/_StreamToBigQuery/StreamInsertRows/ParDo(BigQueryWriteFn)-ptransform-671']"

can someone help me

Your function is returning a table name that is wrong as you can see here:

KeyError: "None:ticketing.test2

It means, the variable project_name is empty ( None ). You have to set it before hand, or get it from the data you as sending. As in the example you are following, the data has the tablename field:

{"tablename": "data-analytics-bk:da_belgium_dataset.cust_data", ...

i add function: def getFullTableName(self): logging.info('>>> tablename= %s', "{0}:{1}".format(parseInputData.project,parseInputData.tablename)) return "{0}:{1}".format(parseInputData.project,parseInputData.tablename)

and change the WriteToBigquery: ... table = lambda l: data_ingestion.getFullTableName()

this fix the problem

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM