简体繁体中英

Move data from PostgreSQL to AWS S3 and analyze with RedShift Spectrum

原文 2018-11-05 15:44:09 3 2 postgresql/ amazon-web-services/ apache-spark/ amazon-redshift/ amazon-redshift-spectrum

I have a big amount of PostgreSQL tables with different schemas and the massive amount of data inside them.

I'm unable to do the data analytics right now because the data amount is quite large - a few TB of data and PostgreSQL is not able to process queries in a reasonable amount of time.

I'm thinking about the following approach - I'll process all of my PostgreSQL tables with Apache Spark, load the DataFrames and store them as the Parquet files in AWS S3. Then I'll use RedShift Spectrum in order to query the information stored inside of these PARQUET files.

First of all, I'd like to ask - will this solution work at all?

And the second - will RedShift Spectrum be able to automatically create EXTERNAL tables from these Parquet files without additional schema specification(even when the original PostgreSQL tables contain the unsupported data types by AWS RedShift)?

2 answers

Redshift Spectrum pretty much supports same datatypes as Redshift itself.
Redshift Spectrum creates cluster of compute nodes behind the scenes. The size of cluster is based on number of actual Redshift Cluster nodes, so if you plan to create 1 node Redshift cluster, Spectrum will run pretty slow.
As you noted in comments, you can use Athena to query the data, and it will be better option in your case instead of Spectrum. But Athena has several limitations, like 30 min run time, memory consumption etc. So if you plan to do complicated queries with several joins, it can just not work.
Redshift Spectrum can't create external tables without provided structure.
Best solution in your case will be to use Spark (on EMR, or Glue) to transform the data, Athena to query it, and if Athena can't do specific query - use SparkSQL on same data. You can use Glue, but running jobs on EMR on Spot Instances will be more flexible and cheaper. EMR clusters comes with EMRFS, which gives you the ability to use S3 almost transparently instead of HDFS.

AWS Glue might be interesting as an option for you. It is both a hosted version of Spark, with some AWS specific addons and a Data Crawler + Data Catalogue.

It can crawl unstructured data such as Parquet files and figure out the structure. Which then allows you to export it to AWS RedShift in structured form if needed.

See this blog post on how to connect it to a postgres database using JDBC to move data from Postgres to S3.

AWS Redshift loading data from S3

How to pipe data from AWS Postgres RDS to S3 (then Redshift)?

Error Copying Data From S3 to AWS Redshift With Psycopg2

Copy data from S3 file into aws postgresql: invalid argument?

Export big data from PostgreSQL to AWS s3

Copying data from S3 to Redshift

Loading data from S3 to PostgreSQL RDS

Transfer data from redshift to postgresql

COPY from S3 to Redshift not recognizing newline

Unload from Redshift to S3: Authentication not working

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question AWS Redshift loading data from S3 How to pipe data from AWS Postgres RDS to S3 (then Redshift)? Error Copying Data From S3 to AWS Redshift With Psycopg2 Copy data from S3 file into aws postgresql: invalid argument? Export big data from PostgreSQL to AWS s3 Copying data from S3 to Redshift Loading data from S3 to PostgreSQL RDS Transfer data from redshift to postgresql COPY from S3 to Redshift not recognizing newline Unload from Redshift to S3: Authentication not working

Related Tags

Move data from PostgreSQL to AWS S3 and analyze with RedShift Spectrum

Question

2 answers

solution1
1 ACCPTED 2018-11-06 09:31:08

solution2
0 2018-11-05 15:53:26

Move data from PostgreSQL to AWS S3 and analyze with RedShift Spectrum

Question

2 answers

solution1 1 ACCPTED 2018-11-06 09:31:08

solution2 0 2018-11-05 15:53:26

solution1
1 ACCPTED 2018-11-06 09:31:08

solution2
0 2018-11-05 15:53:26