简体   繁体   中英

How to use AWS Kinesis Firehose to push nested structure to Redshift

We are using Kinesis Firehose to push data to s3 and to Redshift. We are pushing the whole object in s3 and only pushing a subset of fields to redshift.

Here is an example of the object we are currently pushing to Firehose.

[
  {
    field1: 1,
    field2: 1,
    arr: [
      {inner_field1: 1, inner_field2: 1}, 
      {inner_field1: 1, inner_field2: 1}
    ]
},
...
]

Right now only field1 and field2 are pushed to redshift but we would also want to push the arr field to Redshift.

First option we thought about is to use the new SUPER type, but I didn't find any documentation on how to push SUPER type object from firehose to redshift.

Second option (and preferred in our case) is to flatten the structure prior to pushing in Redshift.

So, using our example object above, we would want to see a table with 4 columns field1, field2, inner_field1, inner_field2 and our example object would result in 2 rows.

Assuming your table format is:

CREATE TABLE super_test (
    field1 INTEGER,
    field2 INTEGER,
    arr SUPER
);

I ended up finding success with the "Copying a JSON document into multiple SUPER data columns" solution when using the json_paths from this page: https://docs.aws.amazon.com/redshift/latest/dg/ingest-super.html

In my case, I have a JSON sub-object rather than an 'arr' array element, but I would think the solution would be the same since both are valid JSON constructs.

My COPY options in Kinesis Firehose are similar to:

format as json 's3://<bucket-name>/schema/kinesis-schema.json'

The AWS examples do not have the as in the format as json above. Unclear if that as is required. I know that it works for me with it there.

Here is the full COPY statement reported by Firehose:

COPY super_test FROM 's3://<bucket-name>/<manifest>' CREDENTIALS 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' MANIFEST format as json 's3://<bucket-name>/schema/kinesis-schema.json';

where kinesis-schema.json would have the following format based on your field names:

{
    "jsonpaths": [
        "$.field1",
        "$.field2",
        "$.arr"
    ]
}

This is what at least works for me. Hoping this at least helps you get pointed in the right direction.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM