使用 Dataflow 将 CoGroupByKey 接收器左连接到 BigQuery

Question

我想使用 CoGroupByKey 的连接键“peakid”加入文件（expeditions-2010s.csv 和 peaks.csv）。 但是，当我将其接收到 BigQuery 时出现错误： RuntimeError: BigQuery job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_88_215864ba592a2e01f0c4e2157cc60c47_86e3562707f348c29b2a030cb6ed7ded failed。 错误结果：<ErrorProto 位置： 'gs://bucket-name/input/temp/bq_load/ededcfb43cda4d16934011481e2fd774/project_name.dataset.expeditions/9fe30f70-8473-44bc-86d5-20dfdf59f502'消息：'读取数据时出错，错误消息: JSON 表遇到太多错误，放弃。 行数：1； 错误： 1. 请查看 errors[] 集合以获取更多详细信息。 文件：gs://bucket-name/input/temp/bq_load/ededcfb43cda4d16934011481e2fd774/project_name.dataset.expeditions/9fe30f70-8473-44bc-86d5-20dfdf59f502' 原因：'无效'> [运行时写入 BigQuery/BigQueryBatchFileLoads/ WaitForDestinationLoadJobs']。

请查看以下代码：

def read_csv_pd_input1(readable_file):
    import json
    import pandas as pd   
    import csv
    import io
    gcs_file = beam.io.filesystems.FileSystems.open(readable_file)
    csv_dict = csv.DictReader(io.TextIOWrapper(gcs_file))
    df = pd.DataFrame(csv_dict)
    df = df[['peakid', 'bcdate', 'smtdate']]
    
    a = df.set_index('peakid')[['bcdate', 'smtdate']].apply(tuple,1).to_dict()
    a = tuple(a.items())
    
    # result: only column name   
    # a = df.agg(lambda x: (x.values)).apply(tuple)

    # result: only value but not as expected    
    # a = [tuple(x) for x in df.values]
    # a = tuple(a)
    return a

def read_csv_pd_input3(readable_file):
    import json
    import pandas as pd   
    import csv
    import io
    gcs_file = beam.io.filesystems.FileSystems.open(readable_file)
    csv_dict = csv.DictReader(io.TextIOWrapper(gcs_file))
    df = pd.DataFrame(csv_dict)
    df = df[['peakid', 'pkname', 'heightm']] 
    
    a = df.set_index('peakid')[['pkname', 'heightm']].apply(tuple,1).to_dict()
    a = tuple(a.items())
    
    return a


def run(argv=None):
    import apache_beam as beam
    import io

    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--input',
        dest='input',
        required=False,
        help='Input file to read. This can be a local file or '
        'a file in a Google Storage Bucket.',
        default='gs://bucket-name/input/expeditions- 2010s.csv')
    
    parser.add_argument(
        '--input3',
        dest='input3',
        required=False,
        help='Input_p3 file to read. This can be a local file or '
        'a file in a Google Storage Bucket.',
        default='gs://bucket-name/input/peaks.csv')
     
    known_args, pipeline_args = parser.parse_known_args(argv)

    pipeline_options = PipelineOptions(pipeline_args)

    p = beam.Pipeline(options=PipelineOptions(pipeline_args))
    input_p1 = (
        p
         | 'Read From GCS input1' >> beam.Create([known_args.input])
         | 'Pair each employee with key p1' >> beam.FlatMap(read_csv_pd_input1)
         # | beam.Map(print)
        
    )
    input_p3 = (
        p
         | 'Read From GCS input3' >> beam.Create([known_args.input3])
         | 'Pair each employee with key p3' >> beam.FlatMap(read_csv_pd_input3)
    )
    # CoGroupByKey: relational join of 2 or more key/values PCollection. It also accept dictionary of key value
    output = (
        {'input_p1': input_p1, 'input_p3': input_p3} 
        | 'Join' >> beam.CoGroupByKey()
        | 'Write To BigQuery' >> beam.io.gcp.bigquery.WriteToBigQuery(
           table='project_name:dataset.expeditions',
           schema='peakid:STRING,bcdate:DATE,pkname:STRING,heightm:INTEGER',
           method='FILE_LOADS',
           custom_gcs_temp_location='gs://bucket-name/input/temp',
           create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
           write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)    
    )
    p.run().wait_until_finish()
    # runner = DataflowRunner()
    # runner.run_pipeline(p, options=options)

if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run()

Answer 1

这部分管道是错误的：

  | 'Join' >> beam.CoGroupByKey()
        | 'Write To BigQuery' >> beam.io.gcp.bigquery.WriteToBigQuery(...

CoGroupByKey 的CoGroupByKey将具有格式key, {'input_p1': [list_of_p1_elems_with_key], 'input_p3': [list_of_p3_elems_with_key]} 。 您需要将 output 到 map 处理为 BigQuery 接收器预期的架构。

由于数据架构与 BigQuery 接收器中指定的架构不匹配，因此数据提取失败。

Beam编程指南有一个例子说明如何处理CoGroupByKey的output ，变换目录也有例子。

Answer 2

我不确定p1和p3的列是如何用于填充 BigQuery 表的。 但除此之外，在beam.CoGroupByKey之后，您可以应用一个beam.Map和一个类似于这个的 function：

def process_group(kv):
  key, values = kv
  input_p1_list = values['input_p1']
  input_p3_list = values['input_p3']
  for p1 in input_p1_list:
    for p3 in input_p3_list:
       row_for_bq = {'peak_id': key, 'bcdate': p1['something'], 'heightm': p3['something'] }
       yield row_for_bq

使用 Dataflow 将 CoGroupByKey 接收器左连接到 BigQuery

问题描述

2 个解决方案

解决方案1
1 2022-09-08 16:22:10

解决方案2
0 2022-09-13 08:10:01

使用 Dataflow 将 CoGroupByKey 接收器左连接到 BigQuery

问题描述

2 个解决方案

解决方案1 1 2022-09-08 16:22:10

解决方案2 0 2022-09-13 08:10:01

解决方案1
1 2022-09-08 16:22:10

解决方案2
0 2022-09-13 08:10:01