简体   繁体   中英

Filtering data loaded into Redshift

We have raw data stored in S3 as parquet. I want a subset of that data loaded into Redshift. To be clear, the Redshift data would be the result of a query (joins, filters, aggregations) of the raw data.

I originally thought that I could build views in Athena, and load the results into Redshift - but seems that it's not that simple !

Glue ETL jobs need an S3 or RDS source - will not accept a view from Athena. (Cannot crawl a view either).

Next solution, was to have a play with the Athena CTAS functionality, write the results of the view to S3, and then load into RedShift. However, there is no 'overwrite' option with CTAS.

So questions ... Is there an easier way to approach this ? (seems a simple requirement) Is there an easy workaround to execute a CTAS with 'overwrite' behaviour ? With that, would have to be a solution that could be bundled up into a scheduled job - and already I think is leading into a custom script.

When a simple job becomes so difficult - I cannot help but think I'm missing something simple !?

Thanks

Ol' reliable: use a lambda! Lambda functions can programmatically connect to both s3 and redshift to execute SQL statements, and you have many options for what will trigger the lambda (if it's just a one-time thing, you can just have it be a scheduled lambda). You will be able use cloudwatch logs to examine the process too.

But beware: I noticed that you stored your data as a parquet... Normal Redshift does not support parquet formatted data. So, if you want to store types like structs, etc. you will need to use Redshift Spectrum.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM