简体   繁体   English

AWS Glue 数据从 S3 转移到 Redshift

[英]AWS Glue Data moving from S3 to Redshift

I have around 70 tables in one S3 bucket and I would like to move them to the redshift using glue.我在一个 S3 存储桶中有大约 70 个表,我想使用胶水将它们移动到红移。 I could move only few tables.我只能移动几张桌子。 Rest of them are having data type issue.其余的有数据类型问题。 Redshift is not accepting some of the data types. Redshift 不接受某些数据类型。 I resolved the issue in a set of code which moves tables one by one:我在一组代码中解决了这个问题,这些代码一个接一个地移动表:

table1 = glueContext.create_dynamic_frame.from_catalog(
    database="db1_g", table_name="table1"
)
table1 = table1.resolveChoice(
    specs=[
        ("column1", "cast:char"),
        ("column2", "cast:varchar"),
        ("column3", "cast:varchar"),
    ]
)
table1 = glueContext.write_dynamic_frame.from_jdbc_conf(
    frame=table1,
    catalog_connection="redshift",
    connection_options={"dbtable": "schema1.table1", "database": "db1"},
    redshift_tmp_dir=args["TempDir"],
    transformation_ctx="table1",
)

The same script is used for all other tables having data type change issue.相同的脚本用于具有数据类型更改问题的所有其他表。 But, As I would like to automate the script, I used looping tables script which iterate through all the tables and write them to redshift.但是,由于我想自动化脚本,我使用了循环表脚本,它遍历所有表并将它们写入 redshift。 I have 2 issues related to this script.我有 2 个与此脚本相关的问题。

  1. Unable to move the tables to respective schemas in redshift.无法将表移动到红移中的相应架构。
  2. Unable to add if condition in the loop script for those tables which needs data type change.无法在循环脚本中为需要更改数据类型的表添加 if 条件。
client = boto3.client("glue", region_name="us-east-1")

databaseName = "db1_g"
Tables = client.get_tables(DatabaseName=databaseName)
tableList = Tables["TableList"]

for table in tableList:
    tableName = table["Name"]
    datasource0 = glueContext.create_dynamic_frame.from_catalog(
        database="db1_g", table_name=tableName, transformation_ctx="datasource0"
    )

    datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(
        frame=datasource0,
        catalog_connection="redshift",
        connection_options={
            "dbtable": tableName,
            "database": "schema1.db1",
        },
        redshift_tmp_dir=args["TempDir"],
        transformation_ctx="datasink4",
    )
job.commit()

Mentioning redshift schema name along with tableName like this: schema1.tableName is throwing error which says schema1 is not defined .像这样schema1.tableName模式名称和tableNameschema1.tableName is throwing error schema1.tableName schema1 is not defined

Can anybody help in changing data type for all tables which requires the same, inside the looping script itself?任何人都可以帮助更改循环脚本本身内部需要相同的所有表的数据类型吗?

So the first problem is fixed rather easily.所以第一个问题很容易解决。 The schema belongs into the dbtable attribute and not the database , like this:架构属于dbtable属性而不是database ,如下所示:

connection_options={
            "dbtable": f"schema1.{tableName},
            "database": "db1",
}

Your second problem is that you want to call resolveChoice inside of the for Loop, correct?你的第二个问题是你想在 for 循环内调用resolveChoice ,对吗? What kind of error occurs there?那里会发生什么样的错误? Why doesn't it work?为什么不起作用?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM