简体   繁体   English

如何将数据从AWS Postgres RDS传输到S3(然后称为Redshift)?

[英]How to pipe data from AWS Postgres RDS to S3 (then Redshift)?

I'm using AWS data pipeline service to pipe data from a RDS MySql database to s3 and then on to Redshift , which works nicely. 我正在使用AWS数据管道服务将数据从RDS MySql数据库管道传输到s3 ,然后再传输到Redshift ,效果很好。

However, I also have data living in an RDS Postres instance which I would like to pipe the same way but I'm having a hard time setting up the jdbc-connection. 但是,我也有数据存在于RDS Postres实例中,我想以相同的方式进行管道RDS Postres ,但是我很难设置jdbc-connection。 If this is unsupported, is there a work-around? 如果不支持,是否有解决方法?

"connectionString": "jdbc:postgresql://THE_RDS_INSTANCE:5432/THE_DB”

Nowadays you can define a copy-activity to extract data from a Postgres RDS instance into S3. 如今,您可以定义一个复制活动,以将数据从Postgres RDS实例提取到S3中。 In the Data Pipeline interface: 在数据管道界面中:

  1. Create a data node of the type SqlDataNode. 创建类型为SqlDataNode的数据节点。 Specify table name and select query 指定表名并选择查询
  2. Setup the database connection by specifying RDS instance ID (the instance ID is in your URL, eg your-instance-id.xxxxx.eu-west-1.rds.amazonaws.com) along with username, password and database name. 通过指定RDS实例ID(实例ID在您的URL中,例如your-instance-id.xxxxx.eu-west-1.rds.amazonaws.com)以及用户名,密码和数据库名称来建立数据库连接。
  3. Create a data node of the type S3DataNode 创建类型为S3DataNode的数据节点
  4. Create a Copy activity and set the SqlDataNode as input and the S3DataNode as output 创建一个Copy活动并将SqlDataNode设置为输入,将S3DataNode设置为输出

this doesn't work yet. 这还行不通。 aws hasnt built / released the functionality to connect nicely to postgres. aws hasnt尚未构建/发布了可以很好地连接到postgres的功能。 you can do it in a shellcommandactivity though. 您可以通过shellcommandactivity来完成。 you can write a little ruby or python code to do it and drop that in a script on s3 using scriptUri. 您可以编写一些ruby或python代码来做到这一点,然后使用scriptUri将其放在s3上的脚本中。 you could also just write a psql command to dump the table to a csv and then pipe that to OUTPUT1_STAGING_DIR with "staging: true" in that activity node. 您还可以只编写一个psql命令以将表转储到csv,然后在该活动节点中通过“ staging:true”将其通过管道传输到OUTPUT1_STAGING_DIR。

something like this: 像这样的东西:

{
  "id": "DumpCommand",
  "type": "ShellCommandActivity",
  "runsOn": { "ref": "MyEC2Resource" },
  "stage": "true",
  "output": { "ref": "S3ForRedshiftDataNode" },
  "command": "PGPASSWORD=password psql -h HOST -U USER -d DATABASE -p 5432 -t -A -F\",\" -c \"select blah_id from blahs\" > ${OUTPUT1_STAGING_DIR}/my_data.csv"
}

i didn't run this to verify because it's a pain to spin up a pipeline :( so double check the escaping in the command. 我没有运行它来进行验证,因为旋转管道很麻烦:(因此,请仔细检查命令中的转义。

  • pros: super straightforward and requires no additional script files to upload to s3 优点:超级简单,不需要其他脚本文件即可上载到s3
  • cons: not exactly secure. 缺点:并不完全安全。 your db password will be transmitted over the wire without encryption. 您的数据库密码将通过网络传输而无需加密。

look into the new stuff aws just launched on parameterized templating data pipelines: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-custom-templates.html . 查看刚刚在参数化模板数据管道上发布的AWS新东西: http : //docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-custom-templates.html it looks like it will allow encryption of arbitrary parameters. 看起来它将允许加密任意参数。

AWS now allow partners to do near real time RDS -> Redshift inserts. AWS现在允许合作伙伴进行近实时RDS-> Redshift插入。

https://aws.amazon.com/blogs/aws/fast-easy-free-sync-rds-to-redshift/ https://aws.amazon.com/blogs/aws/fast-easy-free-sync-rds-to-redshift/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM