简体   繁体   English

AWS Redshift数据处理

[英]AWS Redshift Data Processing

I'm working with a small company currently that stores all of their app data in an AWS Redshift cluster. 我正在与一家小公司合作,该公司目前将所有应用数据存储在AWS Redshift集群中。 I have been tasked with doing some data processing and machine learning on the data in that Redshift cluster. 我的任务是对Redshift集群中的数据进行一些数据处理和机器学习。

The first task I need to do requires some basic transforming of existing data in that cluster into some new tables based on some fairly simple SQL logic. 我需要做的第一项任务需要基于一些相当简单的SQL逻辑将该集群中现有数据的一些基本转换为一些新表。 In an MSSQL environment, I would simply put all the logic into a parameterized stored procedure and schedule it via SQL Server Agent Jobs. 在MSSQL环境中,我只需将所有逻辑放入参数化存储过程中,并通过SQL Server代理作业进行调度。 However, sprocs don't appear to be a thing in Redshift. 但是,sprocs似乎不是Redshift中的东西。 How would I go about creating a SQL job and scheduling it to run nightly (for example) in an AWS environment? 我将如何创建SQL作业并将其安排为在AWS环境中每晚运行(例如)?

The other task I have involves developing a machine learning model (in Python) and scoring records in that Redshift database. 我的另一项任务涉及开发机器学习模型(在Python中)并在Redshift数据库中对记录进行评分。 What's the best way to host my python logic and do the data processing if the plan is to pull data from that Redshift cluster, score it, and then insert it into a new table on the same cluster? 如果计划是从Redshift集群中提取数据,对其进行评分,然后将其插入到同一集群上的新表中,那么托管我的python逻辑并进行数据处理的最佳方法是什么? It seems like I could spin up an EC2 instance, host my python scripts on there, do the processing on there as well, and schedule the scripts to run via cron? 好像我可以启动EC2实例,在那里托管我的python脚本,也在那里进行处理,并安排脚本通过cron运行?

I see tons of AWS (and non-AWS) products that look like they might be relevant (AWS Glue/Data Pipeline/EMR), but there's so many that I'm a little overwhelmed. 我看到大量的AWS(和非AWS)产品看起来可能与它们相关(AWS Glue / Data Pipeline / EMR),但有很多让我有点不知所措。 Thanks in advance for the assistance! 在此先感谢您的帮助!

ETL ETL

Amazon Redshift does not support stored procedures. Amazon Redshift不支持存储过程。 Also, I should point out that stored procedures are generally a bad thing because you are putting logic into a storage layer, which makes it very hard to migrate to other solutions in the future. 此外,我应该指出,存储过程通常是一件坏事,因为您将逻辑放入存储层,这使得将来很难迁移到其他解决方案。 (I know of many Oracle customers who have locked themselves into never being able to change technologies!) (我知道许多Oracle客户已经锁定自己无法改变技术!)

You should run your ETL logic external to Redshift, simply using Redshift as a database. 您应该在Redshift外部运行ETL逻辑,只需使用Redshift作为数据库。 This could be as simple as running a script that uses psql to call Redshift, such as: 这可以像运行使用psql调用Redshift的脚本一样简单,例如:

`psql <authentication stuff> -c 'insert into z select a, b, from x'`

(Use psql v8, upon which Redshift was based.) (使用pssh v8,Redshift所基于的。)

Alternatively, you could use more sophisticated ETL tools such as AWS Glue (not currently in every Region) or 3rd-party tools such as Bryte . 或者,您可以使用更复杂的ETL工具,例如AWS Glue (目前不在每个地区)或第三方工具,如Bryte

Machine Learning 机器学习

Yes, you could run code on an EC2 instance. 是的,您可以在EC2实例上运行代码。 If it is small, you could use AWS Lambda (maximum 5 minutes run-time). 如果它很小,您可以使用AWS Lambda(最长5分钟运行时)。 Many ML users like using Spark on Amazon EMR. 许多ML用户喜欢在Amazon EMR上使用Spark。 It depends upon the technology stack you require. 这取决于您需要的技术堆栈。

Amazon CloudWatch Events can schedule Lambda functions, which could then launch EC2 instances that could do your processing and then self-Terminate. Amazon CloudWatch Events可以安排 Lambda函数,然后可以启动可以进行处理然后自我终止的EC2实例。

Lots of options, indeed! 确实有很多选择!

The 2 options for running ETL on Redshift 在Redshift上运行ETL的2个选项

  1. Create some "create table as" type SQL, which will take your source tables as input and generate your target (transformed table) 创建一些“create table as”类型的SQL,它将把源表作为输入并生成目标(转换表)
  2. Do the transformation outside of the database using an ETL tool. 使用ETL工具在数据库外部进行转换。 For example EMR or Glue. 例如EMR或Glue。

Generally, in an MPP environment such as Redshift, the best practice is to push the ETL to the powerful database (ie option 1). 通常,在诸如Redshift之类的MPP环境中,最佳实践是将ETL推送到功能强大的数据库(即选项1)。

Only consider taking the ETL outside of Redshift (option 2) where SQL is not the ideal tool for the transformation, or the transformation is likely to take a huge amount of compute resource. 只考虑在Redshift之外使用ETL(选项2),其中SQL不是转换的理想工​​具,或者转换可能需要大量的计算资源。

There is no inbuilt scheduling or orchestration tool. 没有内置的计划或编排工具。 Apache Airflow is a good option if you need something more full featured than cron jobs. 如果您需要比cron作业更全面的功能,Apache Airflow是一个不错的选择。

Basic transforming of existing data 现有数据的基本转换

It seems you are a python developer (as you told you are developing Python based ML model), you can do the transformation by following the steps below: 看来你是一个python开发人员(正如你告诉你正在开发基于Python的ML模型),你可以按照以下步骤进行转换:

  1. You can use boto3 ( https://aws.amazon.com/sdk-for-python/ ) in order to talk with Redshift from any workstation of you LAN (make sure your IP has proper privilege) 您可以使用boto3( https://aws.amazon.com/sdk-for-python/ )从LAN的任何工作站与Redshift交谈(确保您的IP具有适当的权限)
  2. You can write your own functions using Python that mimics stored procedures. 您可以使用模仿存储过程的Python编写自己的函数。 Inside these functions, you can put / constrict your transformation logic. 在这些函数中,您可以放置​​/收缩转换逻辑。
  3. Alternatively, you can create function-using python in Redshift as well that will act like Stored Procedure. 或者,您也可以在Redshift中创建函数使用python,它将像存储过程一样。 See more here ( https://aws.amazon.com/blogs/big-data/introduction-to-python-udfs-in-amazon-redshift/ ) 在此处查看更多信息( https://aws.amazon.com/blogs/big-data/introduction-to-python-udfs-in-amazon-redshift/
  4. Finally, you can use windows scheduler / corn job to schedule your Python scripts with parameters like SQL Server Agent job does 最后,您可以使用Windows scheduler / corn作业来调度Python脚本,使用SQL Server Agent作业等参数

Best way to host my python logic 托管我的python逻辑的最佳方式

It seems to me you are reading some data from Redshift then create test and training set and finally get some predicted result (records).If so: 在我看来,你正在从Redshift读取一些数据,然后创建测试和训练集,最后得到一些预测结果(记录)。如果:

  1. Host the scrip in any of your server (LAN) and connect to Redshift using boto3. 在任何服务器(LAN)中托管脚本,并使用boto3连接到Redshift。 If you need to get large number of rows to be transferred over internet, then EC2 in the same region will be an option. 如果您需要通过Internet传输大量行,则可以选择同一区域中的EC2。 Enable the EC2 in ad-hoc basis, complete you job and disable it. 在ad-hoc基础上启用EC2,完成工作并禁用它。 It will be cost effective. 这将具有成本效益。 You can do it using AWS framework. 您可以使用AWS框架来完成。 I have done this using .Net framework. 我使用.Net框架完成了这项工作。 I assume boto3 does have this support. 我假设boto3确实有这种支持。
  2. If your result set are relatively smaller you can directly save them into the target redshift table 如果结果集相对较小,则可以直接将它们保存到目标红移表中
  3. If result sets are larger save them into CSV (there are several Python libraries) and upload the rows into a staging table using copy command if you need any intermediate calculation. 如果结果集较大,请将它们保存为CSV(有几个Python库),如果需要任何中间计算,则使用copy命令将行上载到临时表中。 If not, upload them directly into the target table. 如果没有,请将它们直接上传到目标表中。

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM