简体   繁体   English

将数据从胶水加载到雪花

[英]Loading data from glue to snowflake

I am trying to run an ETL job on glue where I am extracting data into a spark dataframe from a mongodb into glue and load it into snowflake.我正在尝试在胶水上运行 ETL 作业,我将数据从 mongodb 提取到火花 dataframe 到胶水中,并将其加载到雪花中。

This is the sample schema of the Spark dataframe这是 Spark dataframe 的示例模式

|-- login: struct (nullable = true)
 |    |-- login_attempts: integer (nullable = true)
 |    |-- last_attempt: timestamp (nullable = true)
 |-- name: string (nullable = true)
 |-- notifications: struct (nullable = true)
 |    |-- bot_review_queue: boolean (nullable = true)
 |    |-- bot_review_queue_web_push: boolean (nullable = true)
 |    |-- bot_review_queue_web_push_admin: boolean (nullable = true)
 |    |-- weekly_account_summary: struct (nullable = true)
 |    |    |-- enabled: boolean (nullable = true)
 |    |-- weekly_summary: struct (nullable = true)
 |    |    |-- enabled: boolean (nullable = true)
 |    |    |-- day: integer (nullable = true)
 |    |    |-- hour: integer (nullable = true)
 |    |    |-- minute: integer (nullable = true)
 |-- query: struct (nullable = true)
 |    |-- email_address: string (nullable = true)

I am trying to load the data into snowflake as it is and struct columns as json payload in snowflake but it throws the following error我正在尝试将数据按原样加载到雪花中,并将列结构化为雪花中的 json 有效载荷,但它会抛出以下错误

An error occurred while calling o81.collectToPython.com.mongodb.spark.exceptions.MongoTypeConversionException:Cannot cast ARRAY into a StructType

I also tried to cast the struct columns into string and load it but it throws more or less the same error我还尝试将结构列转换为字符串并加载它,但它或多或少会抛出相同的错误

An error occurred while calling o106.save.  com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a StructType

Really appreciate if I can get some help on it.如果我能得到一些帮助,我将不胜感激。

code below for casting and loading.下面的代码用于铸造和加载。

dynamic_frame = glueContext.create_dynamic_frame.from_options(connection_type="mongodb",
                                                  connection_options=read_mongo_options)
user_df_cast = user_df.select(user_df.login.cast(StringType()),'name',user_df.notifications.cast(StringType()))
datasinkusers = user_df_cast.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "users").mode("append").save()

If your users table in Snowflake has the following schema then casting is not required , as the StructType fields of a SparkSQL DataFrame will map to the VARIANT type in Snowflake automatically:如果您在 Snowflake 中的users表具有以下架构,则不需要强制转换,因为 SparkSQL DataFrame 的StructType字段将 map 自动转换为Snowflake 中的VARIANT类型

CREATE TABLE users (
    login VARIANT
   ,name STRING
   ,notifications VARIANT
   ,query VARIANT
)

Just do the following, no transformations required because the Snowflake Spark Connector understands the data-type and will convert to appropriate JSON representations on its own:只需执行以下操作,无需任何转换,因为 Snowflake Spark Connector 了解数据类型并将自行转换为适当的 JSON 表示形式:

user_df = glueContext.create_dynamic_frame.from_options(
  connection_type="mongodb",
  connection_options=read_mongo_options
)

user_df
  .toDF()
  .write
  .format(SNOWFLAKE_SOURCE_NAME)
  .options(**sfOptions)
  .option("dbtable", "users")
  .mode("append")
  .save()

If you absolutely need to store the StructType fields as plain JSON strings, you'll need to explicitly transform them using the to_json SparkSQL function :如果您绝对需要将StructType字段存储为普通字符串 JSON ,则需要使用to_json SparkSQL function显式转换它们:

from pyspark.sql.functions import to_json

user_df_cast = user_df.select(
  to_json(user_df.login),
  user_df.name,
  to_json(user_df.notifications)
)

This will store JSON strings as simple VARCHAR types which will not let you leverage Snowflake's semi-structured data storage and querying capabilities directly without a PARSE_JSON step (inefficient).这会将 JSON 个字符串存储为简单的VARCHAR类型,这样您就无法在没有PARSE_JSON步骤的情况下直接利用 Snowflake 的半结构化数据存储和查询功能(效率低下)。

Consider using the VARIANT approach shown above, which will allow you to perform queries on the fields directly:考虑使用上面显示的VARIANT方法,这将允许您直接对字段执行查询:

SELECT
    login:login_attempts
   ,login:last_attempt
   ,name
   ,notifications:weekly_summary.enabled
FROM users

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM