简体   繁体   中英

Apache Beam with Dataflow: flag 'ignore_unknown_columns' for WriteToBigQuery not working

I am building a streaming pipeline using Apache Beam (Python SDK version 2.37.0) and Google Dataflow to write some data I receive via Pubsub to BigQuery.

I process the data and end up with rows represented by a dictionary like this:

{'val1': 17.4, 'val2': 40.8, 'timestamp': 1650456507, 'NA_VAL': 'table_name'}

I then want to use WriteToBigQuery to insert the values into my table.

However, my table only has the columns val1 , val2 , and timestamp . Therefore, NA_VAL should be ignored. From how I understand the docs , this should be possible by setting ignore_unknown_columns=True .

However, when running the pipeline in Dataflow, I still receive an error and no values are inserted into the table:

There were errors inserting to BigQuery. Will not retry. Errors were [{'index': 0, 'errors': [{'reason': 'invalid', 'location': 'NA_VAL', 'debugInfo': '', 'message': 'no such field: NA_VAL.'}]}]

I tried with a simple job configuration like this

rows | beam.io.WriteToBigQuery(
            table='PROJECT:DATASET.TABLE',
            ignore_unknown_columns=True)

as well as with those parameters

rows | beam.io.WriteToBigQuery(
            table='PROJECT:DATASET.TABLE',
            ignore_unknown_columns=True,
            create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
            method='STREAMING_INSERTS',
            insert_retry_strategy='RETRY_NEVER')

Question : Am I missing something here that is preventing the pipeline from working? Does anyone have the same issue and/or a solution for this?

Unfortunately you have been bitten by a bug. This was reported as https://issues.apache.org/jira/browse/BEAM-14039 and fixed by https://github.com/apache/beam/pull/16999 . Version 2.38.0 will include this fix. Verification for that release just concluded today, so it should be available quite soon.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM