[英]Python/Pyspark iteration code (for AWS Glue ETL job)
我正在使用 AWS Glue,如果不使用迭代,您将无法读取/写入多个动态帧。 我在下面编写了这段代码,但在两件事上苦苦挣扎:
红移映射:
client_historical_ks --> table_01_a
client_historical_kg --> table_01_b
client_historical_kt --> table_01_c
client_historical_kf --> table_01_d
代码:
client = boto3.client('glue',region_name='us-east-1')
databaseName = 'incomingdata'
tables = client.get_tables(DatabaseName = databaseName)
tableList = tables['TableList']
for table in tableList:
start_prefix = client_historical_
tableName = list(filter(lambda x: x.startswith(start_prefix), table['Name']))
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "incomingdata", table_name = tableName, transformation_ctx = "datasource0")
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = datasource0, catalog_connection = "Redshift", connection_options = {"dbtable": "nameoftablehere", "database": "metadata"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
您可以创建一个映射字典,然后执行您的代码您还可以过滤循环之外的表,然后仅在所需的表上循环。
mapping = {'client_historical_ks': 'table_01_a',
'client_historical_kg': 'table_01_b',
'client_historical_kt': 'table_01_c',
'client_historical_kf': 'table_01_d'}
client = boto3.client('glue',region_name='us-east-1')
databaseName = 'incomingdata'
tables = client.get_tables(DatabaseName = databaseName)
tableList = tables['TableList']
start_prefix = 'client_historical_'
tableNames = list(filter(lambda x: x.startswith(start_prefix), table['Name']))
for table in tableNames:
target_table = mapping.get(table)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "incomingdata", table_name = table, transformation_ctx = "datasource0")
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = datasource0, catalog_connection = "Redshift", connection_options = {"dbtable": target_table, "database": "metadata"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.