使用 Dataflow 将 CoGroupByKey 接收器左连接到 BigQuery

Question

I would like to join files (expeditions- 2010s.csv and peaks.csv) using join key "peakid" with CoGroupByKey.我想使用 CoGroupByKey 的连接键“peakid”加入文件（expeditions-2010s.csv 和 peaks.csv）。 However, there is an error when I sink it to BigQuery: RuntimeError: BigQuery job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_88_215864ba592a2e01f0c4e2157cc60c47_86e3562707f348c29b2a030cb6ed7ded failed.但是，当我将其接收到 BigQuery 时出现错误： RuntimeError: BigQuery job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_88_215864ba592a2e01f0c4e2157cc60c47_86e3562707f348c29b2a030cb6ed7ded failed。 Error Result: <ErrorProto location: 'gs://bucket-name/input/temp/bq_load/ededcfb43cda4d16934011481e2fd774/project_name.dataset.expeditions/9fe30f70-8473-44bc-86d5-20dfdf59f502' message: 'Error while reading data, error message: JSON table encountered too many errors, giving up.错误结果：<ErrorProto 位置： 'gs://bucket-name/input/temp/bq_load/ededcfb43cda4d16934011481e2fd774/project_name.dataset.expeditions/9fe30f70-8473-44bc-86d5-20dfdf59f502'消息：'读取数据时出错，错误消息: JSON 表遇到太多错误，放弃。 Rows: 1;行数：1； errors: 1. Please look into the errors[] collection for more details.错误： 1. 请查看 errors[] 集合以获取更多详细信息。 File: gs://bucket-name/input/temp/bq_load/ededcfb43cda4d16934011481e2fd774/project_name.dataset.expeditions/9fe30f70-8473-44bc-86d5-20dfdf59f502' reason: 'invalid'> [while running 'Write To BigQuery/BigQueryBatchFileLoads/WaitForDestinationLoadJobs'].文件：gs://bucket-name/input/temp/bq_load/ededcfb43cda4d16934011481e2fd774/project_name.dataset.expeditions/9fe30f70-8473-44bc-86d5-20dfdf59f502' 原因：'无效'> [运行时写入 BigQuery/BigQueryBatchFileLoads/ WaitForDestinationLoadJobs']。

Please review code as below:请查看以下代码：

def read_csv_pd_input1(readable_file):
    import json
    import pandas as pd   
    import csv
    import io
    gcs_file = beam.io.filesystems.FileSystems.open(readable_file)
    csv_dict = csv.DictReader(io.TextIOWrapper(gcs_file))
    df = pd.DataFrame(csv_dict)
    df = df[['peakid', 'bcdate', 'smtdate']]
    
    a = df.set_index('peakid')[['bcdate', 'smtdate']].apply(tuple,1).to_dict()
    a = tuple(a.items())
    
    # result: only column name   
    # a = df.agg(lambda x: (x.values)).apply(tuple)

    # result: only value but not as expected    
    # a = [tuple(x) for x in df.values]
    # a = tuple(a)
    return a

def read_csv_pd_input3(readable_file):
    import json
    import pandas as pd   
    import csv
    import io
    gcs_file = beam.io.filesystems.FileSystems.open(readable_file)
    csv_dict = csv.DictReader(io.TextIOWrapper(gcs_file))
    df = pd.DataFrame(csv_dict)
    df = df[['peakid', 'pkname', 'heightm']] 
    
    a = df.set_index('peakid')[['pkname', 'heightm']].apply(tuple,1).to_dict()
    a = tuple(a.items())
    
    return a


def run(argv=None):
    import apache_beam as beam
    import io

    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--input',
        dest='input',
        required=False,
        help='Input file to read. This can be a local file or '
        'a file in a Google Storage Bucket.',
        default='gs://bucket-name/input/expeditions- 2010s.csv')
    
    parser.add_argument(
        '--input3',
        dest='input3',
        required=False,
        help='Input_p3 file to read. This can be a local file or '
        'a file in a Google Storage Bucket.',
        default='gs://bucket-name/input/peaks.csv')
     
    known_args, pipeline_args = parser.parse_known_args(argv)

    pipeline_options = PipelineOptions(pipeline_args)

    p = beam.Pipeline(options=PipelineOptions(pipeline_args))
    input_p1 = (
        p
         | 'Read From GCS input1' >> beam.Create([known_args.input])
         | 'Pair each employee with key p1' >> beam.FlatMap(read_csv_pd_input1)
         # | beam.Map(print)
        
    )
    input_p3 = (
        p
         | 'Read From GCS input3' >> beam.Create([known_args.input3])
         | 'Pair each employee with key p3' >> beam.FlatMap(read_csv_pd_input3)
    )
    # CoGroupByKey: relational join of 2 or more key/values PCollection. It also accept dictionary of key value
    output = (
        {'input_p1': input_p1, 'input_p3': input_p3} 
        | 'Join' >> beam.CoGroupByKey()
        | 'Write To BigQuery' >> beam.io.gcp.bigquery.WriteToBigQuery(
           table='project_name:dataset.expeditions',
           schema='peakid:STRING,bcdate:DATE,pkname:STRING,heightm:INTEGER',
           method='FILE_LOADS',
           custom_gcs_temp_location='gs://bucket-name/input/temp',
           create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
           write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)    
    )
    p.run().wait_until_finish()
    # runner = DataflowRunner()
    # runner.run_pipeline(p, options=options)

if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run()

Answer 1

This part of the pipeline is wrong:这部分管道是错误的：

  | 'Join' >> beam.CoGroupByKey()
        | 'Write To BigQuery' >> beam.io.gcp.bigquery.WriteToBigQuery(...

The output of CoGroupByKey will have the format key, {'input_p1': [list_of_p1_elems_with_key], 'input_p3': [list_of_p3_elems_with_key]} . CoGroupByKey 的CoGroupByKey将具有格式key, {'input_p1': [list_of_p1_elems_with_key], 'input_p3': [list_of_p3_elems_with_key]} 。 You need to process that output to map it to the schema expected by the BigQuery sink.您需要将 output 到 map 处理为 BigQuery 接收器预期的架构。

Because the schema of the data does not match the schema specified in the BigQuery sink, the ingestion of data fails.由于数据架构与 BigQuery 接收器中指定的架构不匹配，因此数据提取失败。

The Beam programming guide has an example of how to process the output of CoGroupByKey , and the transform catalog has an example too . Beam编程指南有一个例子说明如何处理CoGroupByKey的output ，变换目录也有例子。

Answer 2

I am not sure exactly how the columns of p1 and p3 are used to populate the BigQuery table.我不确定p1和p3的列是如何用于填充 BigQuery 表的。 But other than that, after the beam.CoGroupByKey you could apply a beam.Map with a function similar to this one:但除此之外，在beam.CoGroupByKey之后，您可以应用一个beam.Map和一个类似于这个的 function：

def process_group(kv):
  key, values = kv
  input_p1_list = values['input_p1']
  input_p3_list = values['input_p3']
  for p1 in input_p1_list:
    for p3 in input_p3_list:
       row_for_bq = {'peak_id': key, 'bcdate': p1['something'], 'heightm': p3['something'] }
       yield row_for_bq

使用 Dataflow 将 CoGroupByKey 接收器左连接到 BigQuery

问题描述

2 个解决方案

解决方案1
1 2022-09-08 16:22:10

解决方案2
0 2022-09-13 08:10:01

使用 Dataflow 将 CoGroupByKey 接收器左连接到 BigQuery

问题描述

2 个解决方案

解决方案1 1 2022-09-08 16:22:10

解决方案2 0 2022-09-13 08:10:01

解决方案1
1 2022-09-08 16:22:10

解决方案2
0 2022-09-13 08:10:01