[英]How to connect to Redshift from AWS Glue (PySpark)?
I am trying to connect to Redshift and run simple queries from a Glue DevEndpoint (that is requirement) but can not seems to connect.我正在尝试连接到 Redshift 并从 Glue DevEndpoint(这是必需的)运行简单查询,但似乎无法连接。
Following code just times out:以下代码只是超时:
df = spark.read \
.format('jdbc') \
.option("url", "jdbc:redshift://my-redshift-cluster.c512345.us-east-2.redshift.amazonaws.com:5439/dev?user=myuser&password=mypass") \
.option("query", "select distinct(tablename) from pg_table_def where schemaname = 'public'; ") \
.option("tempdir", "s3n://test") \
.option("aws_iam_role", "arn:aws:iam::147912345678:role/my-glue-redshift-role") \
.load()
What could be the reason?可能是什么原因?
I checked URL, user, password and also tried different IAM roles but every time just hangs..我检查了 URL、用户、密码,还尝试了不同的 IAM 角色,但每次都挂起..
Also tried without IAM role (just having URL, user/pass, schema/table that already exists there) and also hangs/timeout:还尝试了没有 IAM 角色(仅具有 URL、用户/密码、模式/表已经存在)并且还挂起/超时:
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:redshift://my-redshift-cluster.c512345.us-east-2.redshift.amazonaws.com:5439/dev") \
.option("dbtable", "public.test") \
.option("user", "myuser") \
.option("password", "mypass") \
.load()
Reading data (directly in Glue SSH terminal) from S3 or from Glue tables (catalog) seems fine so I know that Spark and Dataframes are fine, just there is something with connection to RedShift but not sure what?从 S3 或 Glue 表(目录)读取数据(直接在 Glue SSH 终端中)似乎很好,所以我知道 Spark 和 Dataframes 很好,只是与 RedShift 有一些连接但不确定是什么?
You seem to be on the correct path.你似乎走在正确的道路上。 I connect and query Redshift from Glue PySpark job the same way except a minor change of using
我以相同的方式从 Glue PySpark 作业连接和查询 Redshift,除了使用的微小变化
.format("com.databricks.spark.redshift")
I have also successfully used我也成功使用了
.option("forward_spark_s3_credentials", "true")
instead of代替
.option("iam_role", "my_iam_role")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.