[英]Left join with CoGroupByKey sink to BigQuery using Dataflow
I would like to join files (expeditions- 2010s.csv and peaks.csv) using join key "peakid" with CoGroupByKey.我想使用 CoGroupByKey 的连接键“peakid”加入文件(expeditions-2010s.csv 和 peaks.csv)。 However, there is an error when I sink it to BigQuery: RuntimeError: BigQuery job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_88_215864ba592a2e01f0c4e2157cc60c47_86e3562707f348c29b2a030cb6ed7ded failed.但是,当我将其接收到 BigQuery 时出现错误: RuntimeError: BigQuery job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_88_215864ba592a2e01f0c4e2157cc60c47_86e3562707f348c29b2a030cb6ed7ded failed。 Error Result: <ErrorProto location: 'gs://bucket-name/input/temp/bq_load/ededcfb43cda4d16934011481e2fd774/project_name.dataset.expeditions/9fe30f70-8473-44bc-86d5-20dfdf59f502' message: 'Error while reading data, error message: JSON table encountered too many errors, giving up.错误结果:<ErrorProto 位置: 'gs://bucket-name/input/temp/bq_load/ededcfb43cda4d16934011481e2fd774/project_name.dataset.expeditions/9fe30f70-8473-44bc-86d5-20dfdf59f502'消息:'读取数据时出错,错误消息: JSON 表遇到太多错误,放弃。 Rows: 1;行数:1; errors: 1. Please look into the errors[] collection for more details.错误: 1. 请查看 errors[] 集合以获取更多详细信息。 File: gs://bucket-name/input/temp/bq_load/ededcfb43cda4d16934011481e2fd774/project_name.dataset.expeditions/9fe30f70-8473-44bc-86d5-20dfdf59f502' reason: 'invalid'> [while running 'Write To BigQuery/BigQueryBatchFileLoads/WaitForDestinationLoadJobs'].文件:gs://bucket-name/input/temp/bq_load/ededcfb43cda4d16934011481e2fd774/project_name.dataset.expeditions/9fe30f70-8473-44bc-86d5-20dfdf59f502' 原因:'无效'> [运行时写入 BigQuery/BigQueryBatchFileLoads/ WaitForDestinationLoadJobs']。
Please review code as below:请查看以下代码:
def read_csv_pd_input1(readable_file):
import json
import pandas as pd
import csv
import io
gcs_file = beam.io.filesystems.FileSystems.open(readable_file)
csv_dict = csv.DictReader(io.TextIOWrapper(gcs_file))
df = pd.DataFrame(csv_dict)
df = df[['peakid', 'bcdate', 'smtdate']]
a = df.set_index('peakid')[['bcdate', 'smtdate']].apply(tuple,1).to_dict()
a = tuple(a.items())
# result: only column name
# a = df.agg(lambda x: (x.values)).apply(tuple)
# result: only value but not as expected
# a = [tuple(x) for x in df.values]
# a = tuple(a)
return a
def read_csv_pd_input3(readable_file):
import json
import pandas as pd
import csv
import io
gcs_file = beam.io.filesystems.FileSystems.open(readable_file)
csv_dict = csv.DictReader(io.TextIOWrapper(gcs_file))
df = pd.DataFrame(csv_dict)
df = df[['peakid', 'pkname', 'heightm']]
a = df.set_index('peakid')[['pkname', 'heightm']].apply(tuple,1).to_dict()
a = tuple(a.items())
return a
def run(argv=None):
import apache_beam as beam
import io
parser = argparse.ArgumentParser()
parser.add_argument(
'--input',
dest='input',
required=False,
help='Input file to read. This can be a local file or '
'a file in a Google Storage Bucket.',
default='gs://bucket-name/input/expeditions- 2010s.csv')
parser.add_argument(
'--input3',
dest='input3',
required=False,
help='Input_p3 file to read. This can be a local file or '
'a file in a Google Storage Bucket.',
default='gs://bucket-name/input/peaks.csv')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
p = beam.Pipeline(options=PipelineOptions(pipeline_args))
input_p1 = (
p
| 'Read From GCS input1' >> beam.Create([known_args.input])
| 'Pair each employee with key p1' >> beam.FlatMap(read_csv_pd_input1)
# | beam.Map(print)
)
input_p3 = (
p
| 'Read From GCS input3' >> beam.Create([known_args.input3])
| 'Pair each employee with key p3' >> beam.FlatMap(read_csv_pd_input3)
)
# CoGroupByKey: relational join of 2 or more key/values PCollection. It also accept dictionary of key value
output = (
{'input_p1': input_p1, 'input_p3': input_p3}
| 'Join' >> beam.CoGroupByKey()
| 'Write To BigQuery' >> beam.io.gcp.bigquery.WriteToBigQuery(
table='project_name:dataset.expeditions',
schema='peakid:STRING,bcdate:DATE,pkname:STRING,heightm:INTEGER',
method='FILE_LOADS',
custom_gcs_temp_location='gs://bucket-name/input/temp',
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)
)
p.run().wait_until_finish()
# runner = DataflowRunner()
# runner.run_pipeline(p, options=options)
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
This part of the pipeline is wrong:这部分管道是错误的:
| 'Join' >> beam.CoGroupByKey()
| 'Write To BigQuery' >> beam.io.gcp.bigquery.WriteToBigQuery(...
The output of CoGroupByKey
will have the format key, {'input_p1': [list_of_p1_elems_with_key], 'input_p3': [list_of_p3_elems_with_key]}
. CoGroupByKey 的CoGroupByKey
将具有格式key, {'input_p1': [list_of_p1_elems_with_key], 'input_p3': [list_of_p3_elems_with_key]}
。 You need to process that output to map it to the schema expected by the BigQuery sink.您需要将 output 到 map 处理为 BigQuery 接收器预期的架构。
Because the schema of the data does not match the schema specified in the BigQuery sink, the ingestion of data fails.由于数据架构与 BigQuery 接收器中指定的架构不匹配,因此数据提取失败。
The Beam programming guide has an example of how to process the output of CoGroupByKey , and the transform catalog has an example too . Beam编程指南有一个例子说明如何处理CoGroupByKey的output , 变换目录也有例子。
I am not sure exactly how the columns of p1
and p3
are used to populate the BigQuery table.我不确定p1
和p3
的列是如何用于填充 BigQuery 表的。 But other than that, after the beam.CoGroupByKey
you could apply a beam.Map
with a function similar to this one:但除此之外,在beam.CoGroupByKey
之后,您可以应用一个beam.Map
和一个类似于这个的 function:
def process_group(kv):
key, values = kv
input_p1_list = values['input_p1']
input_p3_list = values['input_p3']
for p1 in input_p1_list:
for p3 in input_p3_list:
row_for_bq = {'peak_id': key, 'bcdate': p1['something'], 'heightm': p3['something'] }
yield row_for_bq
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.