簡體   English   中英

Google Dataflow 作業(在 python 中):請求協助連接數據集:修復類型錯誤

[英]Google Dataflow job ( in python ) : request for Assistance in join over data sets : fixing type error

我是 DataflowRunner 上 Apache Beam 的新手。 我正在嘗試處理基表,然后使用增量表執行 CDC(在增量表中加載增量文件之后。

我收到以下錯誤消息

File "beamETL4.py", line 81, in process_id: TypeError: tuple indices must be integers, not str [while running 'FlatMap(process_id)'] 

任何指針都會有所幫助。 抱歉我還在學習中。

代碼詳情:

  • 代碼包含輸入驗證。
  • 然后讀取輸入文件build Pipeline。
  • 管道以在 BigQuery 的增量表中加載文件。
  • 然后讀取基表,增量表調用進程函數來執行更新。

關於數據:

文件包含 3 列

列名:id、name、salary。

數據類型:整數、字符串、整數

下面是我的代碼模塊

"""
Author :
Vidya 


Modification History :
17-Dec-2019     Vidya       Initial Draft

"""

from __future__ import absolute_import

# Import Libraries
import argparse
import logging
import warnings
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from typing import List, Any

warnings.filterwarnings('ignore')


# Define custom class DataIngestion

class DataIngestion():
    """A helper class the load the file to the big query table."""

    def __init__(self):
        pass

    def parse_method(self, input_string):
        # Strip out carriage return, newline and quote characters.
        values = re.split(",",
                          re.sub('\r\n', '', re.sub(u'"', '', input_string)))
        row = dict(
            zip(('id', 'name', 'salary'), values)
        )
        return row


class DataLakeComparison:
    """helper class """

    def __init__(self):
        pass

    def base_query():
        base_query = """
        SELECT 
        id, 
        name,
        salary
        FROM CDC.base
        """
        return base_query

    def delta_query():
        delta_query = """
        SELECT 
        id, 
        name,
        salary
        FROM CDC.delta 
        """
        return delta_query

    def process_id(self, id, data):
        """This function performs the join of the two datasets."""
        result = list(data['delta'])  # type: List[Any]
        if not data['base']:
            logging.info('id is missing in base')
            return
        if not data['delta']:
            logging.info(' id is missing in delta')
            return

        base = {}
        try:
            base = data['base'][0]
        except KeyError as err:
            traceback.print_exc()
            logging.error("id Not Found error: %s", err)

        for delta in result:
            delta.update(base)

        return result


def run(argv=None):
    """The main function which creates the pipeline and runs it."""
    parser = argparse.ArgumentParser()

    parser.add_argument(
        '--input',
        dest='input',
        required=False,
        help='Input file to read. This can be a local file or '
             'a file in a Google Storage Bucket.',
        default='gs://input-cobalt/delta1.csv'
    )
    parser.add_argument(
        '--output',
        dest='output',
        required=False,
        help='Output BQ table to load the delta file ',
        default='CDC.delta'
    )

    parser.add_argument(
        '--output2',
        dest='output',
        required=False,
        help='Output BQ table to load the base table',
        default='CDC.base'
    )
    # Parse arguments from command line.
    known_args, pipeline_args = parser.parse_known_args(argv)

    data_ingestion = DataIngestion()

    # Instantiate pipeline
    options = PipelineOptions(pipeline_args)

    p = beam.Pipeline(options=options)

    (p
     | 'Read from a File' >> beam.io.ReadFromText(known_args.input, skip_header_lines=1)
     | 'String To BigQuery Row' >>
     beam.Map(lambda s: data_ingestion.parse_method(s))
     | 'Write to BigQuery' >> beam.io.Write(
                beam.io.BigQuerySink(
                    known_args.output,
                    schema='id:INTEGER,name:STRING,salary:INTEGER',
                    create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                    write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))
     )
    datalakecomparison = DataLakeComparison()
    base_data = datalakecomparison.base_query()
    delta_data = datalakecomparison.delta_query()
    base_data = (
            p
            | 'Read Delta from BigQuery ' >> beam.io.Read(
        beam.io.BigQuerySource(query=base_data, use_standard_sql=True))
            |
            'Map id in base' >> beam.Map(
        lambda row: (
            row['id'], row
        )))
    delta_data = (
            p
            | 'Read Delta from BigQuery ' >> beam.io.Read(
        beam.io.BigQuerySource(query=delta_data, use_standard_sql=True))
            |
            'Map id in base' >> beam.Map(
        lambda row: (
            row['id'], row
        )))

    result = {'base': base_data, 'delta': delta_data} | beam.CoGroupByKey()
    joined = result | beam.FlatMap(datalakecomparison.process_id(result))
    joined | 'Write Data to BigQuery' >> beam.io.Write(
        beam.io.BigQuerySink(
            known_args.output2,
            schema='id:INTEGER,name:STRING,salary:INTEGER',
            create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
            write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))

    p.run().wait_until_finish()


# main function

if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run()

我認為有兩個問題:

  1. 您不允許在DoFn內改變您的輸入,但代碼delta.update(base)將改變輸入參數data 這可能會導致意外的副作用,后來會在您得到的錯誤中體現出來。 請在更新前創建該行的淺拷貝。

  2. 您的意思是使用beam.FlatMapTuple(datalakecomparison.process_id)而不是beam.FlatMap(datalakecomparison.process_id(result)) CoGroupByKey 的結果將產生如下記錄: (7, {'base': [{'id': 7, 'name': 'name1' , 'salary': 1}], 'delta': [{'id': 7, 'name': 'name1' , 'salary': 2}]}) 對於上面的例子,process_id 將使用id=id1data={'base': [{'id': id1, 'name': 'name1' , 'salary': 1}], 'delta': ['id': id1, 'name': 'name1' , 'salary': 2]}更多細節見FlatMapTuple

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM