Python/Pyspark 迭代代码（用于 AWS Glue ETL 作业）

Question

I am using AWS Glue and you cannot read/write multiple dynamic frame without using an iteration.我正在使用 AWS Glue，如果不使用迭代，您将无法读取/写入多个动态帧。 I made this code below but am struggling on 2 things:我在下面编写了这段代码，但在两件事上苦苦挣扎：

Is "tableName" ie the filtered list of tables correct (all the tables I want to iterate on start with client_historical_*). “tableName”，即过滤后的表列表是否正确（我要迭代的所有表都以 client_historical_* 开头）。
I am stuck on how to dynamically populate the Redshift table name using the mapping below.我被困在如何使用下面的映射动态填充 Redshift 表名。

Redshift mappings:红移映射：

client_historical_ks --> table_01_a
client_historical_kg --> table_01_b
client_historical_kt --> table_01_c
client_historical_kf --> table_01_d

Code:代码：

client = boto3.client('glue',region_name='us-east-1')

databaseName = 'incomingdata'
tables = client.get_tables(DatabaseName = databaseName)
tableList = tables['TableList']

for table in tableList:
    start_prefix = client_historical_
    tableName = list(filter(lambda x: x.startswith(start_prefix), table['Name']))
    datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "incomingdata", table_name = tableName, transformation_ctx = "datasource0")
    datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = datasource0, catalog_connection = "Redshift", connection_options = {"dbtable": "nameoftablehere", "database": "metadata"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")

Answer 1

You can create a mapping dictionary and then execute your code You can also filter the tables outside of loop and then loop over only on required tables.您可以创建一个映射字典，然后执行您的代码您还可以过滤循环之外的表，然后仅在所需的表上循环。

mapping = {'client_historical_ks': 'table_01_a',
'client_historical_kg': 'table_01_b',
'client_historical_kt': 'table_01_c',
'client_historical_kf': 'table_01_d'}

client = boto3.client('glue',region_name='us-east-1')

databaseName = 'incomingdata'
tables = client.get_tables(DatabaseName = databaseName)
tableList = tables['TableList']
start_prefix = 'client_historical_'
tableNames = list(filter(lambda x: x.startswith(start_prefix), table['Name']))

for table in tableNames:
    target_table = mapping.get(table)
    datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "incomingdata", table_name = table, transformation_ctx = "datasource0")
    datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = datasource0, catalog_connection = "Redshift", connection_options = {"dbtable": target_table, "database": "metadata"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")

Python/Pyspark 迭代代码（用于 AWS Glue ETL 作业）

问题描述

1 个解决方案

解决方案1
1 2020-05-29 06:19:05

Python/Pyspark 迭代代码（用于 AWS Glue ETL 作业）

问题描述

1 个解决方案

解决方案1 1 2020-05-29 06:19:05

解决方案1
1 2020-05-29 06:19:05