简体繁体 English

如何预览AWS Glue作业？

[英]How can I preview AWS Glue jobs?

原文 2019-08-19 09:39:44 0 1 amazon-web-services/ etl/ aws-glue

I´m want to use Glue to extract data from an RDS PostgresDB, transform/clean it and load into an S3 Bucket so I can use Athena and Quicksight to visualize the data and create reports. 我想使用Glue从RDS PostgresDB中提取数据，进行转换/清理并将其加载到S3存储桶中，因此我可以使用Athena和Quicksight可视化数据并创建报告。

I´m currently authoring the Glue job for the data cleanup (remove NULL values and such things). 我目前正在为数据清理编写Glue作业（删除NULL值等）。 But I can see no easy way to preview the job script results. 但是我看不到预览作业脚本结果的简便方法。 I can only see the results in the S3 bucket after running the complete job. 运行完整的作业后，我只能在S3存储桶中看到结果。 And running the job takes at least 10 minutes to start, and a few more to finish. 运行作业至少需要10分钟才能开始，还有更多时间才能完成。 So I have a roundtrip time of about 15 minutes to see if my code is correct. 因此，我有大约15分钟的往返时间，以查看我的代码是否正确。 Is this supposed to be the workflow here? 这应该是这里的工作流程吗？ Am I missing anything? 我有什么想念的吗？

I´m new to the whole BI/data stuff. 我是整个BI /数据的新手。 Maybe I´m following the wrong approach. 也许我采用了错误的方法。 I want to visualize data from RDS in Quicksight and need to do some data cleanup first. 我想在Quicksight中可视化RDS中的数据，并且需要先进行一些数据清理。 Any other approaches that make sense for this scenario? 还有其他适合这种情况的方法吗？ (we are talking about a small dataset of about a few 100MBs) （我们正在谈论的是一个大约100MB的小型数据集）

Thanks! 谢谢！

1 个解决方案

Look into notebooks. 看看笔记本。 You can set them up in the AWS Glue Console. 您可以在AWS Glue控制台中进行设置。 They give you an interactive way of writing your code before you put the script into a Glue Script. 在将脚本放入Glue脚本之前，它们为您提供了一种交互方式来编写代码。 No big difference between Sagemaker (Juypter) and Zeppelin notebooks for standard cases, guess its down to our taste. Sagemaker（Juypter）和Zeppelin笔记本在标准情况下没有太大区别，请猜测这取决于我们的口味。

In general, especially with small datasets, a local development environment might work out for you as well and gives you even more freedom. 通常，特别是对于小型数据集，本地开发环境也可能为您工作，并为您提供更大的自由度。 For larger datasets a common practise is to get a sample of only a few hundred records so it can be processed instantaneous. 对于较大的数据集，通常的做法是仅获取几百条记录的样本，以便可以立即对其进行处理。 Helps a lot during development. 在开发过程中有很大帮助。

And last: Not sure why to go away from Postgres. 最后：不确定为什么要离开Postgres。 What kind of analysis do you want to do you can't do in the Relational world? 您想在关系世界中做哪种分析？ Also, why don't do the clean-up in the DB? 另外，为什么不在数据库中进行清理？