简体   繁体   English

在 Glue ETL 中执行存储过程

[英]Execute Stored Procedure in Glue ETL

How can we execute a SQL statement (like... 'call store_proc();') in Redshift via PySpark Glue ETL job by utilizing a catalog connection?我们如何使用目录连接通过 PySpark Glue ETL 作业在 Redshift 中执行 SQL 语句(例如...'call store_proc();')? I want to pass on the Redshift connection details (host, user, password) from Glue Catalog Connection.我想从 Glue Catalog Connection 传递 Redshift 连接详细信息(主机、用户、密码)。

I understand the 'write_dynamic_frame' option but I am not sure how to only execute a SQL statement against the Redshift server.我了解“write_dynamic_frame”选项,但我不确定如何只对 Redshift 服务器执行 SQL 语句。

glueContext.write_dynamic_frame.from_jdbc_conf (frame=data_frame, catalog_connection="Redshift_Catalog_Conn", connection_options = {"preactions":"call stored_prod();","dbtable":"public.table1","database": "admin"}, redshift_tmp_dir="s3://glue_etl/")

As I understand, you want to call a Stored Procedure in RedShift from your Glue ETL Job.据我了解,您想从 Glue ETL 作业中调用 RedShift 中的存储过程。 One way to do this is as follows: A simpler way to execute a stored procedure in Redshift is as follows.一种方法如下: 在 Redshift 中执行存储过程的一种更简单的方法如下。

post_query="begin; CALL sp_procedure1(); end;" 
datasink = glueContext.write_dynamic_frame.from_jdbc_conf(frame = mydf, \
                                    catalog_connection = "redshift_connection", \
                                    connection_options = {"dbtable": "my_table", "database": "dev","postactions":post_query}, \
                                    redshift_tmp_dir = 's3://tempb/temp/' transformation_ctx = "datasink")

The other more elaborate solution will be run SQL queries in application code.另一个更精细的解决方案是在应用程序代码中运行 SQL 查询。

  1. Establish connection to your RedShift Cluster via Glue connections.通过 Glue 连接建立与 RedShift 集群的连接。 Create dynamic frame in Glue with JDBC option.使用 JDBC 选项在 Glue 中创建动态框架。
 my_conn_options = { "url": "jdbc:redshift://host:port/redshift-database-name", "dbtable": "redshift-table-name", "user": "username", "password": "password", "redshiftTmpDir": args["TempDir"], "aws_iam_role": "arn:aws:iam::account id:role/role-name" } df = glueContext.create_dynamic_frame_from_options("redshift", my_conn_options)
  1. Inorder to execute the stored procedure, we will use Spark SQL.为了执行存储过程,我们将使用 Spark SQL。 So first convert Glue Dynamic Frame to Spark DF.所以首先将 Glue Dynamic Frame 转换为 Spark DF。
 spark_df=df.toDF() spark_df.createOrReplaceTempView("CUSTOM_TABLE_NAME") spark.sql('call store_proc();')

Your stored procedure in RedShift should have return values which can be written out to variables. RedShift 中的存储过程应该具有可以写出到变量的返回值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM