无法使用 Dataflow Apache Beam 沉入 BigQuery

Question

我有 2 个 csv 文件： expeditions- 2010s.csv 和 peaks.csv 与连接键'peak_id' 。 我在 Dataflow 中使用带有 Apache Beam 的笔记本加入它们。 这是我的代码如下

def read_csv_file(readable_file):
    import apache_beam as beam
    import csv
    import io
    import datetime
    # Open a channel to read the file from GCS
    gcs_file = beam.io.filesystems.FileSystems.open(readable_file)

    # Read it as csv, you can also use csv.reader
    csv_dict = csv.DictReader(io.TextIOWrapper(gcs_file))

    for row in csv_dict:
        yield (row)

def run(argv=None):
    import apache_beam as beam
    import io
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--input',
        dest='input',
        required=False,
        help='Input file to read. This can be a local file or '
        'a file in a Google Storage Bucket.',
        # This example file contains a total of only 10 lines.
        # Useful for developing on a small set of data.
        default='gs://bucket/folder/peaks.csv')
    
    parser.add_argument(
        '--input1',
        dest='input1',
        required=False,
        help='Input file to read. This can be a local file or '
        'a file in a Google Storage Bucket.',
        # This example file contains a total of only 10 lines.
        # Useful for developing on a small set of data.
        default='gs://bucket/folder/expeditions- 2010s.csv')

     
    known_args, pipeline_args = parser.parse_known_args(argv)

    pipeline_options = PipelineOptions(pipeline_args)

    p = beam.Pipeline(options=PipelineOptions(pipeline_args))
    input_p1 = (
        p
         | 'Read From GCS input1' >> beam.Create([known_args.input1])
         | 'Parse csv file p1' >> beam.FlatMap(read_csv_file)
         | 'Tuple p1' >> beam.Map(lambda e: (e["peakid"], {'peakid': e["peakid"], 'bcdate': e["bcdate"], 'smtdate':e["smtdate"]}))
    )
    input_p2 = (
        p
         | 'Read From GCS input2' >> beam.Create([known_args.input])
         | 'Parse csv file p2' >> beam.FlatMap(read_csv_file)
         | 'Tuple p2' >> beam.Map(lambda e: (e["peakid"], {'peakid': e["peakid"], 'pkname': e["pkname"], 'heightm':e["heightm"]})) 
    )
    # CoGroupByKey: relational join of 2 or more key/values PCollection. It also accept dictionary of key value
    output = (
        (input_p1, input_p2)
        | 'Join' >> beam.CoGroupByKey()
        | 'Final Dict' >> beam.Map(lambda el: to_final_dict(el[1]))
        # | beam.Map(print)
        | 'Write To BigQuery' >> beam.io.gcp.bigquery.WriteToBigQuery(
           table='project:dataset.expeditions',
           method='FILE_LOADS',
           custom_gcs_temp_location='gs://bucket/folder/temp',
           create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
           write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)    
    )
    p.run().wait_until_finish()
    
def to_final_dict(list_tuple_of_tuple):
        result = {}
        for list_tuple in list_tuple_of_tuple:
            for el in list_tuple:
                result.update(el)
        return result


    # runner = DataflowRunner()
    # runner.run_pipeline(p, options=options)

if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run()

我在写给 BigQuery 之前得到了结果：

{'peakid'：'TKRG'，'bcdate'：'4/24/10'，'smtdate'：'5/5/10'，'pkname'：'Takargo'，'heightm'：'6771'}
{'peakid'：'AMPG'，'bcdate'：'4/5/10'，'smtdate'：''，'pkname'：'Amphu Gyabjen'，'heightm'：'5630'}
{'peakid'：'AMAD'，'bcdate'：'1/27/20'，'smtdate'：'2/2/20'，'pkname'：'Ama Dablam'，'heightm'：'6814'}
{'peakid'：'ANN1'，'bcdate'：'3/27/19'，'smtdate'：'4/23/19'，'pkname'：'Annapurna I'，'heightm'：'8091'}
...

但它无法写入 BigQuery 并出现以下错误： RuntimeError: BigQuery job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_602_215864ba592a2e01f0c4e2157cc60c47_51de5de53b58409da70f699c833c4db5 failed。 错误结果：<ErrorProto 位置：'gs://bucket/folder/temp/bq_load/4bbfc44d750c4af5ab376b2e3c3dedbd/project.dataset.expeditions/25905e46-db76-49f0-9b98-7d77131e3e0d' 消息：'读取数据时出错，错误消息：Z0A72181Z8A2DBB8表遇到太多错误，放弃。 行数：3； 错误： 1. 请查看 errors[] 集合以获取更多详细信息。 文件：gs://bucket/folder/temp/bq_load/4bbfc44d750c4af5ab376b2e3c3dedbd/project.dataset.expeditions/25905e46-db76-49f0-9b98-7d77131e3e0d' 原因：'无效'> [运行时写入 BigQuery/BigQueryLoadBatchFileLoads/WaitForDest ]

Answer 1

我认为日期格式不正确，请为您的日期字段使用以下格式： YYYY-MM-DD => 2013-12-25 ，通常它会解决您的问题。

无法使用 Dataflow Apache Beam 沉入 BigQuery

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-09-19 08:24:58

无法使用 Dataflow Apache Beam 沉入 BigQuery

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-09-19 08:24:58

解决方案1
0 已采纳 2022-09-19 08:24:58