简体   繁体   English

将数据从MySql同步到Amazon RedShift

[英]Synchronize data from MySql to Amazon RedShift

We do some aggregation on huge datasets in Amazon RedShift, and we have some relatively small amount of data in MySQL. 我们对Amazon RedShift中的大型数据集进行了一些聚合,我们在MySQL中有一些相对较少的数据。 For some of the joins in RedShift we need the data in MySQL. 对于RedShift中的一些连接,我们需要MySQL中的数据。 What is the best way to synchronize the MySql data to RedShift? 将MySql数据同步到RedShift的最佳方法是什么? Is there such a thing in redshift like the remote view in oracle? 像oracle中的远程视图那样在redshift中有这样的东西吗? Or should I programatically query MySql and insert / update in RedShift? 或者我应该以编程方式查询MySql并在RedShift中插入/更新?

Redshift now supports loading data from remote hosts via SSH. Redshift现在支持通过SSH 从远程主机加载数据 This technique involves: 该技术涉及:

  1. Adding the public key from the cluster to the authorized_keys file on the remote host(s) 将群集中的公钥添加到远程主机上的authorized_keys文件
  2. Allowing SSH access to the remote host(s) from the IP addresses of the cluster nodes 允许从群集节点的IP地址对远程主机进行SSH访问
  3. Uploading a JSON manifest to S3 specifying the remote host(s), the public key(s), and the command(s) to execute on the remote host 将JSON清单上载到S3,指定远程主机,公钥和要在远程主机上执行的命令
  4. Running the COPY command with a specified manifest file and AWS credentials 使用指定的清单文件和AWS凭据运行COPY命令

The command specified by the manifest runs an arbitrary command that prints text output in a format suitable for ingest by the Redshift COPY command. 清单指定的命令运行任意命令,该命令以适合 Redshift COPY命令摄取格式打印文本输出。

When MySQL data is required for joins in Redshift, we usually just send it over from one to another. 当Redshift中的连接需要MySQL数据时,我们通常只是将它从一个发送到另一个。

It implies: 它意味着:

  1. Redshift: Creating an analogous table schema (bearing in mind Redshift/PSQL's particularities) Redshift:创建类似的表模式(记住Redshift / PSQL的特殊性)
  2. MySQL: Dumping the data table (in csv format) MySQL:转储数据表(采用csv格式)
  3. Zipping the export, and sending it to S3 压缩导出,并将其发送到S3
  4. Redshift: Truncating the table, and importing all data using COPY Redshift:截断表,并使用COPY导入所有数据

Steps 2 to 4 can be scripted, and allow you to send fresh data over to Redshift when necessary or regularly. 步骤2到4可以编写脚本,并允许您在必要或定期将新数据发送到Redshift。

What is "remote view" in Oracle? 什么是Oracle中的“远程视图”?

Anyway, if you can extract data from table to CSV file you have one more scripting option. 无论如何,如果您可以从表格中提取数据到CSV文件,那么您还有一个脚本选项。 You can use Python/boto/psycopg2 combo to script your CSV load to Amazon Redshift. 您可以使用Python / boto / psycopg2组合将CSV加载脚本编写到Amazon Redshift。

In my MySQL_To_Redshift_Loader I do the following: 在我的MySQL_To_Redshift_Loader中,我执行以下操作:

  1. Extract data from MySQL into temp file. 将MySQL中的数据提取到临时文件中。

     loadConf=[ db_client_dbshell ,'-u', opt.mysql_user,'-p%s' % opt.mysql_pwd,'-D',opt.mysql_db_name, '-h', opt.mysql_db_server] ... q=""" %s %s INTO OUTFILE '%s' FIELDS TERMINATED BY '%s' ENCLOSED BY '%s' LINES TERMINATED BY '\\r\\n'; """ % (in_qry, limit, out_file, opt.mysql_col_delim,opt.mysql_quote) p1 = Popen(['echo', q], stdout=PIPE,stderr=PIPE,env=env) p2 = Popen(loadConf, stdin=p1.stdout, stdout=PIPE,stderr=PIPE) ... 
  2. Compress and load data to S3 using boto Python module and multipart upload. 使用boto Python模块和分段上传将数据压缩并加载到S3。

     conn = boto.connect_s3(AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY) bucket = conn.get_bucket(bucket_name) k = Key(bucket) k.key = s3_key_name k.set_contents_from_file(file_handle, cb=progress, num_cb=20, reduced_redundancy=use_rr ) 
  3. Use psycopg2 COPY command to append data to Redshift table. 使用psycopg2 COPY命令将数据附加到Redshift表。

     sql=""" copy %s from '%s' CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s' DELIMITER '%s' FORMAT CSV %s %s %s %s;""" % (opt.to_table, fn, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,opt.delim,quote,gzip, timeformat, ignoreheader) 

Check this simplest way to load Mysql Data to redshift. 检查这种最简单的方法将Mysql数据加载到redshift。 When your expectation just loading initial data snapshots to redshift, try with that free solution. 当您期望将初始数据快照加载到红移时,请尝试使用该免费解决方案。 Moreover you will get schema migration, side by side query console, and some statistical report (with chart) of entire loading process. 此外,您将获得模式迁移,并排查询控制台以及整个加载过程的一些统计报告(带图表)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM