[英]Google Dataflow job ( in python ) : request for Assistance in join over data sets : fixing type error
I am new to Apache Beam on DataflowRunner.我是 DataflowRunner 上 Apache Beam 的新手。 I am trying to work on the base table and then perform CDC with delta table ( after loading the delta file in delta table.我正在尝试处理基表,然后使用增量表执行 CDC(在增量表中加载增量文件之后。
I am getting the below error message我收到以下错误消息
File "beamETL4.py", line 81, in process_id: TypeError: tuple indices must be integers, not str [while running 'FlatMap(process_id)']
Any pointers will help.任何指针都会有所帮助。 Sorry I am still learning.抱歉我还在学习中。
Details of the code :代码详情:
About the data:关于数据:
Files contain 3 columns文件包含 3 列
Column names :id, name , salary.列名:id、name、salary。
Data type :int, string, int数据类型:整数、字符串、整数
Below is my code module下面是我的代码模块
"""
Author :
Vidya
Modification History :
17-Dec-2019 Vidya Initial Draft
"""
from __future__ import absolute_import
# Import Libraries
import argparse
import logging
import warnings
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from typing import List, Any
warnings.filterwarnings('ignore')
# Define custom class DataIngestion
class DataIngestion():
"""A helper class the load the file to the big query table."""
def __init__(self):
pass
def parse_method(self, input_string):
# Strip out carriage return, newline and quote characters.
values = re.split(",",
re.sub('\r\n', '', re.sub(u'"', '', input_string)))
row = dict(
zip(('id', 'name', 'salary'), values)
)
return row
class DataLakeComparison:
"""helper class """
def __init__(self):
pass
def base_query():
base_query = """
SELECT
id,
name,
salary
FROM CDC.base
"""
return base_query
def delta_query():
delta_query = """
SELECT
id,
name,
salary
FROM CDC.delta
"""
return delta_query
def process_id(self, id, data):
"""This function performs the join of the two datasets."""
result = list(data['delta']) # type: List[Any]
if not data['base']:
logging.info('id is missing in base')
return
if not data['delta']:
logging.info(' id is missing in delta')
return
base = {}
try:
base = data['base'][0]
except KeyError as err:
traceback.print_exc()
logging.error("id Not Found error: %s", err)
for delta in result:
delta.update(base)
return result
def run(argv=None):
"""The main function which creates the pipeline and runs it."""
parser = argparse.ArgumentParser()
parser.add_argument(
'--input',
dest='input',
required=False,
help='Input file to read. This can be a local file or '
'a file in a Google Storage Bucket.',
default='gs://input-cobalt/delta1.csv'
)
parser.add_argument(
'--output',
dest='output',
required=False,
help='Output BQ table to load the delta file ',
default='CDC.delta'
)
parser.add_argument(
'--output2',
dest='output',
required=False,
help='Output BQ table to load the base table',
default='CDC.base'
)
# Parse arguments from command line.
known_args, pipeline_args = parser.parse_known_args(argv)
data_ingestion = DataIngestion()
# Instantiate pipeline
options = PipelineOptions(pipeline_args)
p = beam.Pipeline(options=options)
(p
| 'Read from a File' >> beam.io.ReadFromText(known_args.input, skip_header_lines=1)
| 'String To BigQuery Row' >>
beam.Map(lambda s: data_ingestion.parse_method(s))
| 'Write to BigQuery' >> beam.io.Write(
beam.io.BigQuerySink(
known_args.output,
schema='id:INTEGER,name:STRING,salary:INTEGER',
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))
)
datalakecomparison = DataLakeComparison()
base_data = datalakecomparison.base_query()
delta_data = datalakecomparison.delta_query()
base_data = (
p
| 'Read Delta from BigQuery ' >> beam.io.Read(
beam.io.BigQuerySource(query=base_data, use_standard_sql=True))
|
'Map id in base' >> beam.Map(
lambda row: (
row['id'], row
)))
delta_data = (
p
| 'Read Delta from BigQuery ' >> beam.io.Read(
beam.io.BigQuerySource(query=delta_data, use_standard_sql=True))
|
'Map id in base' >> beam.Map(
lambda row: (
row['id'], row
)))
result = {'base': base_data, 'delta': delta_data} | beam.CoGroupByKey()
joined = result | beam.FlatMap(datalakecomparison.process_id(result))
joined | 'Write Data to BigQuery' >> beam.io.Write(
beam.io.BigQuerySink(
known_args.output2,
schema='id:INTEGER,name:STRING,salary:INTEGER',
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))
p.run().wait_until_finish()
# main function
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
I believe there are two issues:我认为有两个问题:
Your not allowed to mutate your inputs within a DoFn but the code delta.update(base)
will mutate the input argument data
.您不允许在DoFn内改变您的输入,但代码delta.update(base)
将改变输入参数data
。 This could be causing an unintended side effect which later manifests in the error your getting.这可能会导致意外的副作用,后来会在您得到的错误中体现出来。 Please create a shallow copy of the row before updating it.请在更新前创建该行的浅拷贝。
Did you mean to use beam.FlatMapTuple(datalakecomparison.process_id)
instead of beam.FlatMap(datalakecomparison.process_id(result))
.您的意思是使用beam.FlatMapTuple(datalakecomparison.process_id)
而不是beam.FlatMap(datalakecomparison.process_id(result))
。 The result of the CoGroupByKey will produce records like: (7, {'base': [{'id': 7, 'name': 'name1' , 'salary': 1}], 'delta': [{'id': 7, 'name': 'name1' , 'salary': 2}]})
. CoGroupByKey 的结果将产生如下记录: (7, {'base': [{'id': 7, 'name': 'name1' , 'salary': 1}], 'delta': [{'id': 7, 'name': 'name1' , 'salary': 2}]})
。 For the above example, process_id will be invoked with id=id1
and data={'base': [{'id': id1, 'name': 'name1' , 'salary': 1}], 'delta': ['id': id1, 'name': 'name1' , 'salary': 2]}
See FlatMapTuple for more details.对于上面的例子,process_id 将使用id=id1
和data={'base': [{'id': id1, 'name': 'name1' , 'salary': 1}], 'delta': ['id': id1, 'name': 'name1' , 'salary': 2]}
更多细节见FlatMapTuple 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.