简体   繁体   中英

Unload data into redshift in a specific partition format

I have a large table that I need to unload to S3, partitioning it by year and month , such that the following folder structure is achieved.

bucket_name/folder_name/year/month/

As of now I'm doing this:

unload ( 'select *, extract(year from question_asked_at) as year, extract(month from question_asked_at) as month from schema_name.table_name'
to 's3://bucket_name/folder_name/'
iam_role <iam_role>
partition by (year, month);

The results are: bucket_name/folder_name/year=2001/month=01/ , bucket_name/folder_name/year=2001/month=02/

The partitioning works but I need to remove the year= and month= prefixes, any suggestions?

The format partition_column=value is a convention coined by Hive. Redshift UNLOAD is following that convention (see Redshift manual for UNLOAD .

I think that to achieve your goal, you'd need to move files to new prefixes (without year= and month= ) as a separate process, using eg. bash or python and some regex magic.


I have tried to scribble how to do that with boto3, and that's what I came up with:

import boto3
import re

s3 = boto3.resource("s3")

bucket_name = "sbochniak-zalon-eu-central-1"
prefix = "firehose_zalon_backend_events/"

keys = [
    o.key
    for o in
    s3.Bucket(bucket_name).objects.filter(Prefix=prefix).all()
]

new_keys = [
    re.sub('^(.*)year=(\w+)(.*)month=(\w+)(.*)$', r'\1\2\3\4\5', k)
    for k in
    keys
]

for old_key, new_key in zip(keys, new_keys):
    s3.Object(bucket_name, new_key).copy_from(CopySource={"Bucket": bucket_name, "Key": old_key})
    s3.Object(bucket_name, old_key).delete()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM