简体繁体中英

Filtering data loaded into Redshift

原文 2019-07-11 08:36:22 6 1 amazon-web-services/ amazon-redshift

We have raw data stored in S3 as parquet. I want a subset of that data loaded into Redshift. To be clear, the Redshift data would be the result of a query (joins, filters, aggregations) of the raw data.

I originally thought that I could build views in Athena, and load the results into Redshift - but seems that it's not that simple !

Glue ETL jobs need an S3 or RDS source - will not accept a view from Athena. (Cannot crawl a view either).

Next solution, was to have a play with the Athena CTAS functionality, write the results of the view to S3, and then load into RedShift. However, there is no 'overwrite' option with CTAS.

So questions ... Is there an easier way to approach this ? (seems a simple requirement) Is there an easy workaround to execute a CTAS with 'overwrite' behaviour ? With that, would have to be a solution that could be bundled up into a scheduled job - and already I think is leading into a custom script.

When a simple job becomes so difficult - I cannot help but think I'm missing something simple !?

Thanks

1 answers

Ol' reliable: use a lambda! Lambda functions can programmatically connect to both s3 and redshift to execute SQL statements, and you have many options for what will trigger the lambda (if it's just a one-time thing, you can just have it be a scheduled lambda). You will be able use cloudwatch logs to examine the process too.

But beware: I noticed that you stored your data as a parquet... Normal Redshift does not support parquet formatted data. So, if you want to store types like structs, etc. you will need to use Redshift Spectrum.

Redshift copy command loaded the data into table but there are no records found in the stl_load_commits table

Filtering date in Redshift by date stored as a string

Row processing data from Redshift to Redshift

Redshift data storage schema

AWS Redshift Data Processing

Pull data from Redshift

Redshift data mirroring / mocking

Upload data to Redshift with PySpark

Monitor data change in redshift

Send MySQL data to Redshift

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Redshift copy command loaded the data into table but there are no records found in the stl_load_commits table Filtering date in Redshift by date stored as a string Row processing data from Redshift to Redshift Redshift data storage schema AWS Redshift Data Processing Pull data from Redshift Redshift data mirroring / mocking Upload data to Redshift with PySpark Monitor data change in redshift Send MySQL data to Redshift

Related Tags

Filtering data loaded into Redshift

Question

1 answers

solution1 1 ACCPTED 2019-07-11 14:37:55

solution1
1 ACCPTED 2019-07-11 14:37:55