简体   繁体   English

将 Glue ETL 作业加载到雪花中时出错

[英]Error loading Glue ETL job into snowflake

I am trying to load data from s3 buckets csv files into snowflake using glue ETL.我正在尝试使用胶水 ETL 将数据从 s3 存储桶 csv 文件加载到雪花中。 Wrote a python script within the ETL job for the same as below:在 ETL 作业中编写了一个 python 脚本,如下所示:

    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    from py4j.java_gateway import java_import
    SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"

    ## @params: [JOB_NAME, URL, ACCOUNT, WAREHOUSE, DB, SCHEMA, USERNAME, PASSWORD]
    args = getResolvedOptions(sys.argv, ['JOB_NAME', 'URL', 'ACCOUNT', 'WAREHOUSE', 'DB', 'SCHEMA', 
    'USERNAME', 'PASSWORD'])
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    java_import(spark._jvm, "net.snowflake.spark.snowflake")


    spark._jvm.net.snowflake.spark.snowflake.SnowflakeConnectorUtils.enablePushdownSession 
     (spark._jvm.org.apache.spark.sql.SparkSession.builder().getOrCreate())
     sfOptions = {
     "sfURL" : args['URL'],
     "sfAccount" : args['ACCOUNT'],
     "sfUser" : args['USERNAME'],
     "sfPassword" : args['PASSWORD'],
     "sfDatabase" : args['DB'],
     "sfSchema" : args['SCHEMA'],
     "sfWarehouse" : args['WAREHOUSE'],
      }

     dyf = glueContext.create_dynamic_frame.from_catalog(database = "salesforcedb", table_name = 
     "pr_summary_csv", transformation_ctx = "dyf")
     df=dyf.toDF()
     ##df.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("parallelism", 
     "8").option("dbtable", "abcdef").mode("overwrite").save()
     df.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "abcdef").save()
     job.commit()

The error thrown is:抛出的错误是:

error occurred while calling o81.save.调用 o81.save 时发生错误。 Incorrect username or password was specified.指定了错误的用户名或密码。

However if I don't convert to Spark data frame, and use directly the dynamic frame I get error like this:但是,如果我不转换为 Spark 数据帧,而是直接使用动态帧,则会出现如下错误:

AttributeError: 'function' object has no attribute 'format' AttributeError: 'function' 对象没有属性 'format'

Could someone please look over my code and tell me what I'm doing wrong for converting a dynamic frame to DF?有人可以查看我的代码并告诉我将动态框架转换为 DF 时我做错了什么吗? Please let me know If I need to provide more information.如果我需要提供更多信息,请告诉我。

BTW , I am newbie to snowflake and this is my trial on loading data through AWS Glue.顺便说一句,我是雪花的新手,这是我通过 AWS Glue 加载数据的试验。 😊 😊

error occurred while calling o81.save.调用 o81.save 时发生错误。 Incorrect username or password was specified.指定了错误的用户名或密码。

The error message says that there's an error about the user or the password.错误消息表示用户或密码有误。 If you are sure that the user name and the password are correct, please be sure that the Snowflake account name and URL are also correct.如果您确定用户名和密码正确,请确保Snowflake 帐户名和URL 也正确。

However if I don't convert to Spark data frame, and use directly the dynamic frame I get error like this:但是,如果我不转换为 Spark 数据帧,而是直接使用动态帧,则会出现如下错误:

AttributeError: 'function' object has no attribute 'format' AttributeError: 'function' 对象没有属性 'format'

The Glue DynamicFrame's write method is different than Spark DataFrame, so it's normal to not to have same methods. Glue DynamicFrame 的 write 方法与 Spark DataFrame 不同,所以没有相同的方法是正常的。 Please check the documentation:请检查文档:

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-write https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#aws-glue-api-crawler-pyspark-extensions-dynamic-帧写

It seems you need to give the parameters as connection_options:看来您需要将参数作为 connection_options 提供:

write(connection_type, connection_options, format, format_options, accumulator_size)

connection_options = {"url": "jdbc-url/database", "user": "username", "password": "password","dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"} 

Even you use the DynamicFrame, you will probably end up with the incorrect username or password error.即使您使用 DynamicFrame,您也可能会遇到不正确的用户名或密码错误。 So I suggest you to focus on fixing the credentials.因此,我建议您专注于修复凭据。

Here is the tested Glue Code (you can copy paste as it is only change the table name ), which you can use for setting up Glue ETL .这是经过测试的 Glue 代码(您可以复制粘贴,因为它只会更改表名),您可以使用它来设置 Glue ETL。 You will have to add the JDBC and Spark jars .You can use the below link for set up: https://community.snowflake.com/s/article/How-To-Use-AWS-Glue-With-Snowflake您必须添加 JDBC 和 Spark jar。您可以使用以下链接进行设置: https : //community.snowflake.com/s/article/How-To-Use-AWS-Glue-With-Snowflake


import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from py4j.java_gateway import java_import
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake";

## @params: [JOB_NAME, URL, ACCOUNT, WAREHOUSE, DB, SCHEMA, USERNAME, PASSWORD]
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'URL', 'ACCOUNT', 'WAREHOUSE', 'DB', 'SCHEMA', 'USERNAME', 'PASSWORD'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)


## uj = sc._jvm.net.snowflake.spark.snowflake
spark._jvm.net.snowflake.spark.snowflake.SnowflakeConnectorUtils.enablePushdownSession(spark._jvm.org.apache.spark.sql.SparkSession.builder().getOrCreate())
sfOptions = {
"sfURL" : args['URL'],
"sfAccount" : args['ACCOUNT'],
"sfUser" : args['USERNAME'],
"sfPassword" : args['PASSWORD'],
"sfDatabase" : args['DB'],
"sfSchema" : args['SCHEMA'],
"sfWarehouse" : args['WAREHOUSE'],
}

## Read from a Snowflake table into a Spark Data Frame
df = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query", "Select * from <tablename>").load()
df.show()

## Perform any kind of transformations on your data and save as a new Data Frame: df1 = df.[Insert any filter, transformation, or other operation]
## Write the Data Frame contents back to Snowflake in a new table df1.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "[new_table_name]").mode("overwrite").save() job.commit()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM