简体   繁体   English

使用 Dataflow 将 CoGroupByKey 接收器左连接到 BigQuery

[英]Left join with CoGroupByKey sink to BigQuery using Dataflow

I would like to join files (expeditions- 2010s.csv and peaks.csv) using join key "peakid" with CoGroupByKey.我想使用 CoGroupByKey 的连接键“peakid”加入文件(expeditions-2010s.csv 和 peaks.csv)。 However, there is an error when I sink it to BigQuery: RuntimeError: BigQuery job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_88_215864ba592a2e01f0c4e2157cc60c47_86e3562707f348c29b2a030cb6ed7ded failed.但是,当我将其接收到 BigQuery 时出现错误: RuntimeError: BigQuery job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_88_215864ba592a2e01f0c4e2157cc60c47_86e3562707f348c29b2a030cb6ed7ded failed。 Error Result: <ErrorProto location: 'gs://bucket-name/input/temp/bq_load/ededcfb43cda4d16934011481e2fd774/project_name.dataset.expeditions/9fe30f70-8473-44bc-86d5-20dfdf59f502' message: 'Error while reading data, error message: JSON table encountered too many errors, giving up.错误结果:<ErrorProto 位置: 'gs://bucket-name/input/temp/bq_load/ededcfb43cda4d16934011481e2fd774/project_name.dataset.expeditions/9fe30f70-8473-44bc-86d5-20dfdf59f502'消息:'读取数据时出错,错误消息: JSON 表遇到太多错误,放弃。 Rows: 1;行数:1; errors: 1. Please look into the errors[] collection for more details.错误: 1. 请查看 errors[] 集合以获取更多详细信息。 File: gs://bucket-name/input/temp/bq_load/ededcfb43cda4d16934011481e2fd774/project_name.dataset.expeditions/9fe30f70-8473-44bc-86d5-20dfdf59f502' reason: 'invalid'> [while running 'Write To BigQuery/BigQueryBatchFileLoads/WaitForDestinationLoadJobs'].文件:gs://bucket-name/input/temp/bq_load/ededcfb43cda4d16934011481e2fd774/project_name.dataset.expeditions/9fe30f70-8473-44bc-86d5-20dfdf59f502' 原因:'无效'> [运行时写入 BigQuery/BigQueryBatchFileLoads/ WaitForDestinationLoadJobs']。

Please review code as below:请查看以下代码:

def read_csv_pd_input1(readable_file):
    import json
    import pandas as pd   
    import csv
    import io
    gcs_file = beam.io.filesystems.FileSystems.open(readable_file)
    csv_dict = csv.DictReader(io.TextIOWrapper(gcs_file))
    df = pd.DataFrame(csv_dict)
    df = df[['peakid', 'bcdate', 'smtdate']]
    
    a = df.set_index('peakid')[['bcdate', 'smtdate']].apply(tuple,1).to_dict()
    a = tuple(a.items())
    
    # result: only column name   
    # a = df.agg(lambda x: (x.values)).apply(tuple)

    # result: only value but not as expected    
    # a = [tuple(x) for x in df.values]
    # a = tuple(a)
    return a

def read_csv_pd_input3(readable_file):
    import json
    import pandas as pd   
    import csv
    import io
    gcs_file = beam.io.filesystems.FileSystems.open(readable_file)
    csv_dict = csv.DictReader(io.TextIOWrapper(gcs_file))
    df = pd.DataFrame(csv_dict)
    df = df[['peakid', 'pkname', 'heightm']] 
    
    a = df.set_index('peakid')[['pkname', 'heightm']].apply(tuple,1).to_dict()
    a = tuple(a.items())
    
    return a


def run(argv=None):
    import apache_beam as beam
    import io

    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--input',
        dest='input',
        required=False,
        help='Input file to read. This can be a local file or '
        'a file in a Google Storage Bucket.',
        default='gs://bucket-name/input/expeditions- 2010s.csv')
    
    parser.add_argument(
        '--input3',
        dest='input3',
        required=False,
        help='Input_p3 file to read. This can be a local file or '
        'a file in a Google Storage Bucket.',
        default='gs://bucket-name/input/peaks.csv')
     
    known_args, pipeline_args = parser.parse_known_args(argv)

    pipeline_options = PipelineOptions(pipeline_args)

    p = beam.Pipeline(options=PipelineOptions(pipeline_args))
    input_p1 = (
        p
         | 'Read From GCS input1' >> beam.Create([known_args.input])
         | 'Pair each employee with key p1' >> beam.FlatMap(read_csv_pd_input1)
         # | beam.Map(print)
        
    )
    input_p3 = (
        p
         | 'Read From GCS input3' >> beam.Create([known_args.input3])
         | 'Pair each employee with key p3' >> beam.FlatMap(read_csv_pd_input3)
    )
    # CoGroupByKey: relational join of 2 or more key/values PCollection. It also accept dictionary of key value
    output = (
        {'input_p1': input_p1, 'input_p3': input_p3} 
        | 'Join' >> beam.CoGroupByKey()
        | 'Write To BigQuery' >> beam.io.gcp.bigquery.WriteToBigQuery(
           table='project_name:dataset.expeditions',
           schema='peakid:STRING,bcdate:DATE,pkname:STRING,heightm:INTEGER',
           method='FILE_LOADS',
           custom_gcs_temp_location='gs://bucket-name/input/temp',
           create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
           write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)    
    )
    p.run().wait_until_finish()
    # runner = DataflowRunner()
    # runner.run_pipeline(p, options=options)

if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run()

This part of the pipeline is wrong:这部分管道是错误的:

  | 'Join' >> beam.CoGroupByKey()
        | 'Write To BigQuery' >> beam.io.gcp.bigquery.WriteToBigQuery(...

The output of CoGroupByKey will have the format key, {'input_p1': [list_of_p1_elems_with_key], 'input_p3': [list_of_p3_elems_with_key]} . CoGroupByKey 的CoGroupByKey将具有格式key, {'input_p1': [list_of_p1_elems_with_key], 'input_p3': [list_of_p3_elems_with_key]} You need to process that output to map it to the schema expected by the BigQuery sink.您需要将 output 到 map 处理为 BigQuery 接收器预期的架构。

Because the schema of the data does not match the schema specified in the BigQuery sink, the ingestion of data fails.由于数据架构与 BigQuery 接收器中指定的架构不匹配,因此数据提取失败。

The Beam programming guide has an example of how to process the output of CoGroupByKey , and the transform catalog has an example too . Beam编程指南有一个例子说明如何处理CoGroupByKey的output变换目录也有例子

I am not sure exactly how the columns of p1 and p3 are used to populate the BigQuery table.我不确定p1p3的列是如何用于填充 BigQuery 表的。 But other than that, after the beam.CoGroupByKey you could apply a beam.Map with a function similar to this one:但除此之外,在beam.CoGroupByKey之后,您可以应用一个beam.Map和一个类似于这个的 function:

def process_group(kv):
  key, values = kv
  input_p1_list = values['input_p1']
  input_p3_list = values['input_p3']
  for p1 in input_p1_list:
    for p3 in input_p3_list:
       row_for_bq = {'peak_id': key, 'bcdate': p1['something'], 'heightm': p3['something'] }
       yield row_for_bq

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 Dataflow 在 Apache Beam 中使用 CoGroupByKey 接收器到 BigQuery - How to use CoGroupByKey sink to BigQuery in Apache Beam using Dataflow 无法使用 Dataflow Apache Beam 沉入 BigQuery - Can not sink to BigQuery using Dataflow Apache Beam 使用 Dataflow Apache Beam 接收到 BigQuery 的正确格式 - Correct format to sink to BigQuery using Dataflow Apache Beam 从 GCS 导入 csv 文件并使用 Dataflow 进行转换,然后使用 Airflow 传感器接收到 BigQuery - Import csv files from GCS and transform using Dataflow then sink to BigQuery using Airflow sensors BigQuery BI 引擎是否支持左连接? - Does BigQuery BI engine support left join or not? 使用 BigQuery 存储写入 API 的 Google 数据流存储到特定分区 - Google Dataflow store to specific Partition using BigQuery Storage Write API 在 Bigquery 中创建物化视图时的左连接解决方法 - Left join workaround when creating materialized view in Bigquery 在数据流中将 BigQuery TableRow 转换为 GenericRecord - Convert BigQuery TableRow to GenericRecord in dataflow 如果其中一个为空,则 BigQuery 中多个键的左外连接不会 - Left outer join in BigQuery on multiple keys doesn't if one of them is null 避免 session 关闭 BigQuery 存储 API 与 Dataflow - Avoid session shutdown on BigQuery Storage API with Dataflow
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM