简体   繁体   中英

AWS Glue denest postgres jsonb column

I would like to flatten a jsonb column into multiple target columns in the same table. I cannot find a built-in function to accomplish this. The Glue crawler registers the jsonb column as a string. I can use Unbox.apply() to change this to a struct when I land the data on s3.

I have tried using Relationalize and UnnestFrame to denest the jsonb column. Neither work. Relationalize seems to apply only go .json files. I am not sure why UnnestFrame doesn't work.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mycatalogdb", table_name = "sourcedb_public_tablename", transformation_ctx = "datasource0")

dfc = UnnestFrame.apply(frame = datasource0, transformation_ctx = "dfc", info="", stageThreshold=0, totalThreshold=0)

dropnullfields3 = DropNullFields.apply(frame = dfc, transformation_ctx = "dropnullfields3")

datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://mybucket"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()

Given a source table with the following


+----+------------+-------------------------------------------------------+
| id |    date    |                        myjson                         |
+----+------------+-------------------------------------------------------+
|  1 | 2019-10-10 | {"url":some-url,"data":{"afield":123,"moredata":567"} |
+----+------------+-------------------------------------------------------+

I would like this output (column name format doesn't matter as much as the tabular format)

+----+------------+----------+-------------+---------------+
| id |    date    |   url    | data_afield | data_moredata |
+----+------------+----------+-------------+---------------+
|  1 | 2019-10-10 | some-url |         123 |           567 |
+----+------------+----------+-------------+---------------+

I eventually figured out, I was using relationalize incorrectly, but Glue was not throwing an error. I was able to figure this out after using SageMaker interactively and realizing while reading this post that relationalize() returns a collection.

Relationalize can be used on a data frame containing json fields. Put another way, the data frame does not have to be from pure json.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM