简体   繁体   English

AWS Glue 定价与 AWS EMR

[英]AWS Glue pricing against AWS EMR

I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue.我正在 AWS Glue 与 AWS EMR 之间进行一些定价比较,以便在 EMR 和 Glue 之间进行选择。

I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for 30 days.我考虑过 6 个 DPU(4 个 vCPU + 16 GB 内存),其中 ETL 作业运行 10 分钟,持续 30 天。 Expected crawler requests is assumed to be 1 million above free tier and is calculated at $1 for the 1 million additional requests.预期的爬网程序请求假设比免费层高 100 万个,并且对于 100 万个额外请求按 1 美元计算。

On EMR I have considered m3.xlarge for both EC2 & EMR (pricing at $0.266 & $0.070 respectively) with 6 nodes, running for 10 minutes for 30 days.在 EMR 上,我曾考虑将 m3.xlarge 用于 EC2 和 EMR(分别定价为 0.266 美元和 0.070 美元),有 6 个节点,运行 10 分钟,持续 30 天。

On calculating for a month, I see that AWS Glue works out to be around $14.64, whereas for EMR it works out to be around $10.08.计算一个月后,我发现 AWS Glue 的计算结果约为 14.64 美元,而 EMR 计算结果约为 10.08 美元。 I have not taken into account other additional expenses such as S3, RDS, Redshift, etc. & DEV Endpoint which is optional, since my objective is to compare ETL job price benefits我没有考虑其他额外费用,例如 S3、RDS、Redshift 等。 & DEV Endpoint 是可选的,因为我的目标是比较 ETL 工作价格优势

Looks like EMR is cheaper when compared to AWS Glue.与 AWS Glue 相比,EMR 看起来更便宜。 Is the EMR pricing correct, can someone please suggest if anything missing? EMR 定价是否正确,如果有任何遗漏,有人可以提出建议吗? I have tried the AWS price calculator for EMR, but confused, and not clear if normalized hours are billed into it.我已经尝试过 EMR 的 AWS 价格计算器,但很困惑,并且不清楚是否将标准化时间计入其中。

Regards问候

Yuva尤瓦

Yes, EMR does work out to be cheaper than Glue, and this is because Glue is meant to be serverless and fully managed by AWS, so the user doesn't have to worry about the infrastructure running behind the scenes, but EMR requires a whole lot of configuration to set up.是的,EMR 确实比 Glue 便宜,这是因为 Glue 是无服务器的并且完全由 AWS 管理,因此用户不必担心在幕后运行的基础设施,但 EMR 需要一个整体很多配置要设置。 So it's a trade off between user friendliness and cost, and for more technical users EMR can be the better option.因此,这是用户友好性和成本之间的权衡,对于更多技术用户来说,EMR 可能是更好的选择。

@user2889316 - Did you check my question wherein I had provided a comparison numbers? @ user2889316 - 您是否检查了我提供的比较数字的问题?

Also please note Glue is roughly about 0.44 per hour / DPU for a job.另请注意,一份工作的胶水大约为每小时 0.44/DPU。 I don't think you will have any AWS Glue JOB that is expected to running throughout the day?我认为您不会有任何预计全天运行的 AWS Glue 作业? Are you talking about the Glue Dev end point or the Job?您是在谈论 Glue Dev 终点还是工作?

A AWS Glue job requires a minimum of 2 DPUs to run, which means 0.88 per hour, which I think roughly about $21 per day? AWS Glue 作业至少需要运行 2 个 DPU,这意味着每小时 0.88,我认为大约每天 21 美元? This is only for the GLUE job and there are additional charges such as S3, and any database / connection charges / crawler charges, etc.这仅适用于 GLUE 作业,还有额外费用,例如 S3,以及任何数据库/连接费用/爬虫费用等。

Corresponding instance for EMR is m3.xlarge & its charges are (pricing at $0.266 & $0.070 respectively). EMR 的对应实例是 m3.xlarge,其费用为(分别定价为 0.266 美元和 0.070 美元)。 This would be approximately less than $16 for 2 instance per day?对于每天 2 个实例,这大约不到 16 美元? plus other S3, database charges, etc. Am considering 2 EMR instances against the default DPUs for AWS Glue job.加上其他 S3、数据库费用等。我正在考虑针对 AWS Glue 作业的默认 DPU 使用 2 个 EMR 实例。

Hope this would give you an idea.希望这会给你一个想法。

Thanks谢谢

If you use Spot instance of EMR instead of On-Demand it will cost 1/3rd of on-Demand price and will turn out to be much cheaper.如果您使用 EMR 的Spot实例而不是On-Demand ,它将花费按需价格的 1/3,并且会便宜得多。 AWS Glue doesn't have that pricing benefits. AWS Glue没有这种定价优势。

If your infrastructure doesn't need drastic scaling (and is mostly with fixed configuration), use EMR.如果您的基础设施不需要大幅扩展(并且主要是固定配置),请使用 EMR。 But if it is needed, Glue is better choice as it is serverless.但如果需要,Glue 是更好的选择,因为它是无服务器的。 By just changing DPUs, your infrastructure is scaled.只需更改 DPU,即可扩展您的基础架构。 However in EMR, you have to decide on cluster type, number of nodes, auto-scaling rules.但是在 EMR 中,您必须决定集群类型、节点数量、自动扩展规则。 For each change, you will need to change cluster creation script, test it, deploy it - basically add overhead of standard release cycle for change.对于每次更改,您都需要更改集群创建脚本,对其进行测试、部署 - 基本上是为更改添加标准发布周期的开销。 With change in infra config, you may want to change spark config to optimize jobs accordingly.随着基础设施配置的变化,您可能需要更改 spark 配置以相应地优化作业。 So time to make new version release is higher with change in infra configuration.因此,随着基础设施配置的变化,发布新版本的时间会更长。 If you add high configuration to start, it will cost more.如果加高配置启动,成本会更高。 If you add low configuration to start, you need frequent changes in script.如果添加低配置启动,需要频繁修改脚本。

Having said that, AWS Glue has fixed infra configuration for each DPU - eg 16GB memory per core.话虽如此,AWS Glue 已经为每个 DPU 固定了基础设施配置 - 例如每个核心 16GB 内存。 If your ETL demands more memory per core, you may have to shift to EMR.如果您的 ETL 需要每个内核更多的内存,您可能必须转向 EMR。 However, if your ETL is designed such a way that it will not exceed 11GB driver memory with 1 executor or 5.5GB with 2 executors (eg Take additional data volume in parallel on new core or divide volume in 5gb/11gb batch and run in for loop on same core), Glue is right choice.但是,如果您的 ETL 被设计为不超过 11GB 驱动程序内存(带有 1 个执行程序)或 5.5GB(带有 2 个执行程序)(例如,在新核心上并行获取额外的数据量或将数据量分成 5GB/11GB 批处理并运行在同一个核心上循环),胶水是正确的选择。

If your ETL is complex and all jobs are going to keep cluster busy throughout day, I would recommend to go with EMR with dedicated devops team to manage EMR infra.如果您的 ETL 很复杂,并且所有作业都会使集群全天忙碌,我建议使用 EMR 和专门的 devops 团队来管理 EMR 基础设施。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM