簡體   English   中英

Python/Pyspark 迭代代碼(用於 AWS Glue ETL 作業)

[英]Python/Pyspark iteration code (for AWS Glue ETL job)

我正在使用 AWS Glue,如果不使用迭代,您將無法讀取/寫入多個動態幀。 我在下面編寫了這段代碼,但在兩件事上苦苦掙扎:

  1. “tableName”,即過濾后的表列表是否正確(我要迭代的所有表都以 client_historical_* 開頭)。
  2. 我被困在如何使用下面的映射動態填充 Redshift 表名。

紅移映射:

client_historical_ks --> table_01_a
client_historical_kg --> table_01_b
client_historical_kt --> table_01_c
client_historical_kf --> table_01_d

代碼:

client = boto3.client('glue',region_name='us-east-1')

databaseName = 'incomingdata'
tables = client.get_tables(DatabaseName = databaseName)
tableList = tables['TableList']

for table in tableList:
    start_prefix = client_historical_
    tableName = list(filter(lambda x: x.startswith(start_prefix), table['Name']))
    datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "incomingdata", table_name = tableName, transformation_ctx = "datasource0")
    datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = datasource0, catalog_connection = "Redshift", connection_options = {"dbtable": "nameoftablehere", "database": "metadata"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")

您可以創建一個映射字典,然后執行您的代碼您還可以過濾循環之外的表,然后僅在所需的表上循環。

mapping = {'client_historical_ks': 'table_01_a',
'client_historical_kg': 'table_01_b',
'client_historical_kt': 'table_01_c',
'client_historical_kf': 'table_01_d'}

client = boto3.client('glue',region_name='us-east-1')

databaseName = 'incomingdata'
tables = client.get_tables(DatabaseName = databaseName)
tableList = tables['TableList']
start_prefix = 'client_historical_'
tableNames = list(filter(lambda x: x.startswith(start_prefix), table['Name']))

for table in tableNames:
    target_table = mapping.get(table)
    datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "incomingdata", table_name = table, transformation_ctx = "datasource0")
    datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = datasource0, catalog_connection = "Redshift", connection_options = {"dbtable": target_table, "database": "metadata"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM