简体   繁体   English

PostgreSQL RDS 中 JSONB 列的 AWS Glue 爬虫

[英]AWS Glue Crawler for JSONB column in PostgreSQL RDS

I've created a crawler that looks at a PostgreSQL 9.6 RDS table with a JSONB column but the crawler identifies the column type as "string".我创建了一个爬虫,它查看带有 JSONB 列的 PostgreSQL 9.6 RDS 表,但爬虫将列类型标识为“字符串”。 When I then try to create a job that loads data from a JSON file on S3 into the RDS table I get an error.然后,当我尝试创建一个将数据从 S3 上的 JSON 文件加载到 RDS 表中的作业时,出现错误。

How can I map a JSON file source to a JSONB target column?如何将 JSON 文件源映射到 JSONB 目标列?

It's not quite a direct copy, but an approach that has worked for me is to define the column on the target table as TEXT.这不是一个直接的副本,但对我有用的方法是将目标表上的列定义为 TEXT。 After the Glue job populates the field, I then convert it to JSONB.在 Glue 作业填充该字段后,我将其转换为 JSONB。 For example:例如:

alter table postgres_table
 alter column column_with_json set data type jsonb using column_with_json::jsonb;

Note the use of the cast for the existing text data.请注意对现有文本数据使用强制转换。 Without that, the alter column would fail.没有它,alter 列就会失败。

Crawler will identify JSONB column type as "string" but you can try to use Unbox Class in Glue to convert this column to json Crawler 会将 JSONB 列类型识别为“字符串”,但您可以尝试使用 Glue 中的 Unbox Class 将此列转换为 json

let's check the following table in PostgreSQL让我们在 PostgreSQL 中检查下表

create table persons (id integer, person_data jsonb, creation_date timestamp )

There is an example of one record from person table有一个来自person表的记录的例子

ID = 1
PERSON_DATA = {
               "firstName": "Sergii",
               "age": 99,
               "email":"Test@test.com"
               }
CREATION_DATE = 2021-04-15 00:18:06

The following code need to be added in Glue Glue中需要添加以下代码

# 1. create dynamic frame from catalog 
df_persons = glueContext.create_dynamic_frame.from_catalog(database = "testdb", table_name = "persons", transformation_ctx = "df_persons ")
# 2.in path you need to add your jsonb column name that need to be converted to json
df_persons_json = Unbox.apply(frame = df_persons , path = "person_data", format="json")
# 3. converting from dynamic frame to data frame 
datf_persons_json = df_persons_json.toDF()

# 4. after that you can process this column as a json datatype or create dataframe with all necessary columns , each json data element can be added as a separate column in dataframe : 
final_df_person = datf_persons_json.select("id","person_data.age","person_data.firstName","creation_date")

You can also check the following link:您还可以查看以下链接:

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-Unbox.html https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-Unbox.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM