I have 2 csv files and need to join them with join key "peakid". I have already transformed them like this
expeditions- 2010s
peak
And when I used CoGroupByKey , the result look like
After that I write to BigQuery, it get an error BigQuery job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_460_215864ba592a2e01f0c4e2157cc60c47_bc7734af2ebb4a53a0e268bbe6c40824 failed. Error Result: <ErrorProto location: 'gs://bucket-name/input/temp/bq_load/ece048e1a1ed41b987210a5c4b5e2c52/project-name.dataset.expeditions/cdcdbb44-2e25-4f4a-a792-34382d828244' message: 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details. File: gs://bucket-name/input/temp/bq_load/ece048e1a1ed41b987210a5c4b5e2c52/project-name.dataset.expeditions/cdcdbb44-2e25-4f4a-a792-34382d828244' reason: 'invalid'> [while running 'Write To BigQuery/BigQueryBatchFileLoads/WaitForDestinationLoadJobs']
Below is code:
input_p1 = (
p
| 'Read From GCS input1' >> beam.Create([known_args.input1])
| 'Parse csv file p1' >> beam.FlatMap(read_csv_file)
| 'Tuple p1' >> beam.Map(lambda e: (e["peakid"], {'bcdate': [e["bcdate"]], 'smtdate':[e["smtdate"]]}))
)
input_p2 = (
p
| 'Read From GCS input2' >> beam.Create([known_args.input])
| 'Parse csv file p2' >> beam.FlatMap(read_csv_file)
| 'Tuple p2' >> beam.Map(lambda e: (e["peakid"], {'pkname': [e["pkname"]], 'heightm':[e["heightm"]]}))
)
output = (
(input_p1, input_p2)
| 'Join' >> beam.CoGroupByKey()
# | beam.Map(print)
| 'Write To BigQuery' >> beam.io.gcp.bigquery.WriteToBigQuery(
table='project-name.dataset.expeditions',
schema='peakid:STRING,bcdate:DATE,pkname:STRING,heightm:INTEGER',
method='FILE_LOADS',
custom_gcs_temp_location='gs://dtnhu_test_dataflow_v1/input/temp',
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)
)
You should add a last map
operation before calling the BigqueryIO/WriteToBigquery
Your last step should have a PCollection
of Dict with the same structure as the output Bigquery table.
Check the schema of your table and apply a last transformation to have a Dict having the same structure.
In the last step you have a Tuple of Tuple, try to transform it to a Dict.
Example if your Bigquery
table has the following schema:
[
{
"name": "idTest",
"type": "STRING",
"mode": "NULLABLE",
"description": "Id"
},
{
"name": "nameTest",
"type": "BOOLEAN",
"mode": "NULLABLE",
"description": "name"
}
]
Your final transformation should return a Dict
with this structure:
def to_element() -> Dict:
return {
'idTest': '22222222',
'nameTest': 'Test'
}
For your tuple, you can recover the key and value with the following code:
def test_with_your_tuple(self):
res = ('ACHN', ([{'bcdate': [''], 'smtdate': ['9/25/15']}, {'bcdate': [''], 'smtdate': ['9/3/15']},
{'bcdate': [''], 'smtdate': ['']}], [{'pkname': ['Aichyn'], 'heightm': ['6055']}]))
key: str = res[0]
value:List[Dict] = res[1][0]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.