Google Dataflow 作业（在 python 中）：请求协助连接数据集：修复类型错误

Question

I am new to Apache Beam on DataflowRunner.我是 DataflowRunner 上 Apache Beam 的新手。 I am trying to work on the base table and then perform CDC with delta table ( after loading the delta file in delta table.我正在尝试处理基表，然后使用增量表执行 CDC（在增量表中加载增量文件之后。

I am getting the below error message我收到以下错误消息

File "beamETL4.py", line 81, in process_id: TypeError: tuple indices must be integers, not str [while running 'FlatMap(process_id)']

Any pointers will help.任何指针都会有所帮助。 Sorry I am still learning.抱歉我还在学习中。

Details of the code :代码详情：

Code contains validation for input.代码包含输入验证。
Then reads the input file builds Pipeline.然后读取输入文件build Pipeline。
pipeline to load the file in delta table in BigQuery.管道以在 BigQuery 的增量表中加载文件。
Then reads the base table and the delta table calls the process function to perform the update.然后读取基表，增量表调用进程函数来执行更新。

About the data:关于数据：

Files contain 3 columns文件包含 3 列

Column names :id, name , salary.列名：id、name、salary。

Data type :int, string, int数据类型：整数、字符串、整数

Below is my code module下面是我的代码模块

"""
Author :
Vidya 


Modification History :
17-Dec-2019     Vidya       Initial Draft

"""

from __future__ import absolute_import

# Import Libraries
import argparse
import logging
import warnings
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from typing import List, Any

warnings.filterwarnings('ignore')


# Define custom class DataIngestion

class DataIngestion():
    """A helper class the load the file to the big query table."""

    def __init__(self):
        pass

    def parse_method(self, input_string):
        # Strip out carriage return, newline and quote characters.
        values = re.split(",",
                          re.sub('\r\n', '', re.sub(u'"', '', input_string)))
        row = dict(
            zip(('id', 'name', 'salary'), values)
        )
        return row


class DataLakeComparison:
    """helper class """

    def __init__(self):
        pass

    def base_query():
        base_query = """
        SELECT 
        id, 
        name,
        salary
        FROM CDC.base
        """
        return base_query

    def delta_query():
        delta_query = """
        SELECT 
        id, 
        name,
        salary
        FROM CDC.delta 
        """
        return delta_query

    def process_id(self, id, data):
        """This function performs the join of the two datasets."""
        result = list(data['delta'])  # type: List[Any]
        if not data['base']:
            logging.info('id is missing in base')
            return
        if not data['delta']:
            logging.info(' id is missing in delta')
            return

        base = {}
        try:
            base = data['base'][0]
        except KeyError as err:
            traceback.print_exc()
            logging.error("id Not Found error: %s", err)

        for delta in result:
            delta.update(base)

        return result


def run(argv=None):
    """The main function which creates the pipeline and runs it."""
    parser = argparse.ArgumentParser()

    parser.add_argument(
        '--input',
        dest='input',
        required=False,
        help='Input file to read. This can be a local file or '
             'a file in a Google Storage Bucket.',
        default='gs://input-cobalt/delta1.csv'
    )
    parser.add_argument(
        '--output',
        dest='output',
        required=False,
        help='Output BQ table to load the delta file ',
        default='CDC.delta'
    )

    parser.add_argument(
        '--output2',
        dest='output',
        required=False,
        help='Output BQ table to load the base table',
        default='CDC.base'
    )
    # Parse arguments from command line.
    known_args, pipeline_args = parser.parse_known_args(argv)

    data_ingestion = DataIngestion()

    # Instantiate pipeline
    options = PipelineOptions(pipeline_args)

    p = beam.Pipeline(options=options)

    (p
     | 'Read from a File' >> beam.io.ReadFromText(known_args.input, skip_header_lines=1)
     | 'String To BigQuery Row' >>
     beam.Map(lambda s: data_ingestion.parse_method(s))
     | 'Write to BigQuery' >> beam.io.Write(
                beam.io.BigQuerySink(
                    known_args.output,
                    schema='id:INTEGER,name:STRING,salary:INTEGER',
                    create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                    write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))
     )
    datalakecomparison = DataLakeComparison()
    base_data = datalakecomparison.base_query()
    delta_data = datalakecomparison.delta_query()
    base_data = (
            p
            | 'Read Delta from BigQuery ' >> beam.io.Read(
        beam.io.BigQuerySource(query=base_data, use_standard_sql=True))
            |
            'Map id in base' >> beam.Map(
        lambda row: (
            row['id'], row
        )))
    delta_data = (
            p
            | 'Read Delta from BigQuery ' >> beam.io.Read(
        beam.io.BigQuerySource(query=delta_data, use_standard_sql=True))
            |
            'Map id in base' >> beam.Map(
        lambda row: (
            row['id'], row
        )))

    result = {'base': base_data, 'delta': delta_data} | beam.CoGroupByKey()
    joined = result | beam.FlatMap(datalakecomparison.process_id(result))
    joined | 'Write Data to BigQuery' >> beam.io.Write(
        beam.io.BigQuerySink(
            known_args.output2,
            schema='id:INTEGER,name:STRING,salary:INTEGER',
            create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
            write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))

    p.run().wait_until_finish()


# main function

if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run()

Answer 1

I believe there are two issues:我认为有两个问题：

Your not allowed to mutate your inputs within a DoFn but the code delta.update(base) will mutate the input argument data .您不允许在DoFn内改变您的输入，但代码delta.update(base)将改变输入参数data 。 This could be causing an unintended side effect which later manifests in the error your getting.这可能会导致意外的副作用，后来会在您得到的错误中体现出来。 Please create a shallow copy of the row before updating it.请在更新前创建该行的浅拷贝。
Did you mean to use beam.FlatMapTuple(datalakecomparison.process_id) instead of beam.FlatMap(datalakecomparison.process_id(result)) .您的意思是使用beam.FlatMapTuple(datalakecomparison.process_id)而不是beam.FlatMap(datalakecomparison.process_id(result)) 。 The result of the CoGroupByKey will produce records like: (7, {'base': [{'id': 7, 'name': 'name1' , 'salary': 1}], 'delta': [{'id': 7, 'name': 'name1' , 'salary': 2}]}) . CoGroupByKey 的结果将产生如下记录： (7, {'base': [{'id': 7, 'name': 'name1' , 'salary': 1}], 'delta': [{'id': 7, 'name': 'name1' , 'salary': 2}]}) 。 For the above example, process_id will be invoked with id=id1 and data={'base': [{'id': id1, 'name': 'name1' , 'salary': 1}], 'delta': ['id': id1, 'name': 'name1' , 'salary': 2]} See FlatMapTuple for more details.对于上面的例子，process_id 将使用id=id1和data={'base': [{'id': id1, 'name': 'name1' , 'salary': 1}], 'delta': ['id': id1, 'name': 'name1' , 'salary': 2]}更多细节见FlatMapTuple 。

Google Dataflow 作业（在 python 中）：请求协助连接数据集：修复类型错误

问题描述

1 个解决方案

解决方案1
1 2020-01-08 18:53:32

Google Dataflow 作业（在 python 中）：请求协助连接数据集：修复类型错误

问题描述

1 个解决方案

解决方案1 1 2020-01-08 18:53:32

解决方案1
1 2020-01-08 18:53:32