TypeError when connecting to Google Cloud BigQuery from Apache Beam Dataflow in Python?

Question

When trying to initialize the python BigQuery Client() in apache beam's google cloud dataflow, its giving me a type error:

TypeError('__init__() takes 2 positional arguments but 3 were given')

I am using Python 3.7 with apache beam dataflow, and I have to initialize the client and write to BigQuery manually instead of using a ptransform because I want to use a dynamic table name which is passed through runtime parameters.

I've tried passing through the project and credentials to the client but it doesn't seem to do anything. Furthermore if I use google-cloud-bigquery==1.11.2 instead of 1.13.0 it works fine, also using the 1.13.0 outside of apache beam also works completely fine.

I have obviously cut out a bit of code, but this is essentially what is throwing the error

class SaveObjectsBigQuery(beam.DoFn):
    def process(self, element, *args, **kwargs):
        # Establish BigQuery client
        client = bigquery.Client(project=project)


def run():
    pipeline_options = PipelineOptions()

    # GoogleCloud options object
    cloud_options = pipeline_options.view_as(GoogleCloudOptions)

    pipeline_options.view_as(SetupOptions).save_main_session = True

    with beam.Pipeline(options=pipeline_options) as p:
        _data = (p
                 | "Create" >> beam.Create(["Start"])
                 )

        save_data_bigquery = _data | "Save to BigQuery" >> beam.ParDo(SaveObjectsBigQuery())

In earlier versions of google-cloud-bigquery this works fine, and I am able to create a table with the runtime parameter and insert_rows_json without any problem. Obviously using the WriteToBigquery Ptransform would be ideal but it's not possible due to the necessity of dynamically naming the bigquery tables.

EDIT:

I updated the code to try to take out a runtime value provider and lambda function, although recieved an similar error for both:

`AttributeError: 'function/RuntimeValueProvider' object has no attribute 'tableId'

I am essentially trying to use a Runtime Value Provider when launching a dataflow template to dynamically name a bigquery table using the WriteToBigQuery PTransform.

save_data_bigquery = _data | WriteToBigQuery(
            project=project,
            dataset="campaign_contact",
            table=value_provider.RuntimeValueProvider(option_name="table", default_value=None, value_type=str),
            schema="id:STRING",
            create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
            write_disposition=BigQueryDisposition.WRITE_APPEND
        )

save_data_bigquery = _data | WriteToBigQuery(
            table=lambda table: f"{project}:dataset.{runtime_options.table}",
            schema="id:STRING",
            create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
            write_disposition=BigQueryDisposition.WRITE_APPEND
        )

Answer 1

As of Beam 2.12, you can use the WriteToBigQuery transform to assign destinations dynamically. I'd recommend you try it out : )

Check out this test in the Beam codebase that shows an example of this.

TypeError when connecting to Google Cloud BigQuery from Apache Beam Dataflow in Python?

Question

1 answers

solution1
1 ACCPTED 2019-06-04 17:15:52

TypeError when connecting to Google Cloud BigQuery from Apache Beam Dataflow in Python?

Question

1 answers

solution1 1 ACCPTED 2019-06-04 17:15:52

solution1
1 ACCPTED 2019-06-04 17:15:52