如何从 AWS Glue (PySpark) 连接到 Redshift？

Question

I am trying to connect to Redshift and run simple queries from a Glue DevEndpoint (that is requirement) but can not seems to connect.我正在尝试连接到 Redshift 并从 Glue DevEndpoint（这是必需的）运行简单查询，但似乎无法连接。

Following code just times out:以下代码只是超时：

df = spark.read \
  .format('jdbc') \
  .option("url", "jdbc:redshift://my-redshift-cluster.c512345.us-east-2.redshift.amazonaws.com:5439/dev?user=myuser&password=mypass") \
  .option("query", "select distinct(tablename) from pg_table_def where schemaname = 'public'; ") \
  .option("tempdir", "s3n://test") \
  .option("aws_iam_role", "arn:aws:iam::147912345678:role/my-glue-redshift-role") \
  .load()

What could be the reason?可能是什么原因？

I checked URL, user, password and also tried different IAM roles but every time just hangs..我检查了 URL、用户、密码，还尝试了不同的 IAM 角色，但每次都挂起..

Also tried without IAM role (just having URL, user/pass, schema/table that already exists there) and also hangs/timeout:还尝试了没有 IAM 角色（仅具有 URL、用户/密码、模式/表已经存在）并且还挂起/超时：

jdbcDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:redshift://my-redshift-cluster.c512345.us-east-2.redshift.amazonaws.com:5439/dev") \
    .option("dbtable", "public.test") \
    .option("user", "myuser") \
    .option("password", "mypass") \
    .load()

Reading data (directly in Glue SSH terminal) from S3 or from Glue tables (catalog) seems fine so I know that Spark and Dataframes are fine, just there is something with connection to RedShift but not sure what?从 S3 或 Glue 表（目录）读取数据（直接在 Glue SSH 终端中）似乎很好，所以我知道 Spark 和 Dataframes 很好，只是与 RedShift 有一些连接但不确定是什么？

Answer 1

Select last option while creating glue job. Select 创建胶水作业时的最后一个选项。 And in next screen, it will ask to select Glue connection在下一个屏幕中，它会询问 select Glue connection

Answer 2

You seem to be on the correct path.你似乎走在正确的道路上。 I connect and query Redshift from Glue PySpark job the same way except a minor change of using我以相同的方式从 Glue PySpark 作业连接和查询 Redshift，除了使用的微小变化

.format("com.databricks.spark.redshift")

I have also successfully used我也成功使用了

.option("forward_spark_s3_credentials", "true")

instead of代替

.option("iam_role", "my_iam_role")

如何从 AWS Glue (PySpark) 连接到 Redshift？

问题描述

2 个解决方案

解决方案1
0 2019-10-12 05:29:40

解决方案2
0 2020-10-29 01:39:34

如何从 AWS Glue (PySpark) 连接到 Redshift？

问题描述

2 个解决方案

解决方案1 0 2019-10-12 05:29:40

解决方案2 0 2020-10-29 01:39:34

解决方案1
0 2019-10-12 05:29:40

解决方案2
0 2020-10-29 01:39:34