简体   繁体   中英

How to use CoGroupByKey sink to BigQuery in Apache Beam using Dataflow

I have 2 csv files and need to join them with join key "peakid". I have already transformed them like this
expeditions- 2010s

  • ('TKRG', {'bcdate': ['3/6/10'], 'smtdate': ['3/12/10']})
  • ('AMPG', {'bcdate': ['4/5/10'], 'smtdate': ['']})
  • ('AMAD', {'bcdate': ['4/5/10'], 'smtdate': ['4/21/10']})
  • ('AMAD', {'bcdate': ['4/20/10'], 'smtdate': ['4/27/10']})
  • ('AMAD', {'bcdate': ['4/4/10'], 'smtdate': ['4/10/10']})
  • ...

peak

  • ('ACHN', {'pkname': ['Aichyn'], 'heightm': ['6055']})
  • ('AGLE', {'pkname': ['Agole East'], 'heightm': ['6675']})
  • ('AMAD', {'pkname': ['Ama Dablam'], 'heightm': ['6814']})
  • ('AMOT', {'pkname': ['Amotsang'], 'heightm': ['6393']})
  • ('AMPG', {'pkname': ['Amphu Gyabjen'], 'heightm': ['5630']})
  • ...

And when I used CoGroupByKey , the result look like

  • ('ACHN', ([{'bcdate': [''], 'smtdate': ['9/25/15']}, {'bcdate': [''], 'smtdate': ['9/3/15']}, {'bcdate': [''], 'smtdate': ['']}], [{'pkname': ['Aichyn'], 'heightm': ['6055']}]))
  • ('AGLE', ([], [{'pkname': ['Agole East'], 'heightm': ['6675']}]))
  • ('AMAD', ([{'bcdate': ['4/5/10'], 'smtdate': ['4/21/10']}, {'bcdate': ['4/20/10'], 'smtdate': ['4/27/10']}, {'bcdate': ['4/4/10'], 'smtdate': ['4/10/10']},...

After that I write to BigQuery, it get an error BigQuery job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_460_215864ba592a2e01f0c4e2157cc60c47_bc7734af2ebb4a53a0e268bbe6c40824 failed. Error Result: <ErrorProto location: 'gs://bucket-name/input/temp/bq_load/ece048e1a1ed41b987210a5c4b5e2c52/project-name.dataset.expeditions/cdcdbb44-2e25-4f4a-a792-34382d828244' message: 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details. File: gs://bucket-name/input/temp/bq_load/ece048e1a1ed41b987210a5c4b5e2c52/project-name.dataset.expeditions/cdcdbb44-2e25-4f4a-a792-34382d828244' reason: 'invalid'> [while running 'Write To BigQuery/BigQueryBatchFileLoads/WaitForDestinationLoadJobs']

Below is code:

input_p1 = (
        p
         | 'Read From GCS input1' >> beam.Create([known_args.input1])
         | 'Parse csv file p1' >> beam.FlatMap(read_csv_file)
         | 'Tuple p1' >> beam.Map(lambda e: (e["peakid"], {'bcdate': [e["bcdate"]], 'smtdate':[e["smtdate"]]}))
    )
 input_p2 = (
        p
         | 'Read From GCS input2' >> beam.Create([known_args.input])
         | 'Parse csv file p2' >> beam.FlatMap(read_csv_file)
         | 'Tuple p2' >> beam.Map(lambda e: (e["peakid"], {'pkname': [e["pkname"]], 'heightm':[e["heightm"]]}))
    )
output = (
        (input_p1, input_p2)
        | 'Join' >> beam.CoGroupByKey()
        # | beam.Map(print)
        | 'Write To BigQuery' >> beam.io.gcp.bigquery.WriteToBigQuery(
           table='project-name.dataset.expeditions',
           schema='peakid:STRING,bcdate:DATE,pkname:STRING,heightm:INTEGER',
           method='FILE_LOADS',
           custom_gcs_temp_location='gs://dtnhu_test_dataflow_v1/input/temp',
           create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
           write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)    
    )

You should add a last map operation before calling the BigqueryIO/WriteToBigquery

Your last step should have a PCollection of Dict with the same structure as the output Bigquery table.

Check the schema of your table and apply a last transformation to have a Dict having the same structure.

In the last step you have a Tuple of Tuple, try to transform it to a Dict.

Example if your Bigquery table has the following schema:

[
  {
    "name": "idTest",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "Id"
  },
  {
    "name": "nameTest",
    "type": "BOOLEAN",
    "mode": "NULLABLE",
    "description": "name"
  }
]

Your final transformation should return a Dict with this structure:

def to_element() -> Dict:
   return {
        'idTest': '22222222',
        'nameTest': 'Test'
   }

For your tuple, you can recover the key and value with the following code:

   def test_with_your_tuple(self):
        res = ('ACHN', ([{'bcdate': [''], 'smtdate': ['9/25/15']}, {'bcdate': [''], 'smtdate': ['9/3/15']},
                         {'bcdate': [''], 'smtdate': ['']}], [{'pkname': ['Aichyn'], 'heightm': ['6055']}]))

        key: str = res[0]

        value:List[Dict] = res[1][0]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM