简体繁体 English

大多数时间 Redshift CPU 利用率为 100%

[英]Redshift CPU utilisation is 100 percent most of the time

原文 2022-03-23 21:23:47 9 1 amazon-web-services/ amazon-redshift

I have a 96 Vcpu Redshift ra3.4xlarge 8 node cluster, Most of the times the CPU utilisation is 100 percent, It was a dc2.large 3 node cluster before, that was also always 100 percent that's why we increased it to ra3.我有一个 96 Vcpu Redshift ra3.4xlarge 8 节点集群，大多数时候 CPU 利用率是 100%，之前它是一个 dc2.large 3 节点集群，它也总是 100%，这就是我们将它增加到 ra3 的原因。 We are doing most of our computes on Redshift but the data is not that much, I read somewhere Doesn't matter how much compute you increase unless its significantly.我们在 Redshift 上进行大部分计算，但数据并不多，我在某处读到，除非显着增加，否则增加多少计算并不重要。 there will only be a slight improvement in the Computation?计算只会有轻微的改善？ Can anyone explain this?谁能解释一下？

1 个解决方案

I can give it a shot.我可以试一试。 Having 100% CPU for long stretches of time is generally not a good (optimal) thing in Redshift.在 Redshift 中长时间保持 100% 的 CPU 通常不是一件好事（最佳）。 You see Redshift is made for performing analytics on massive amounts of structured data.您会看到 Redshift 是为对大量结构化数据执行分析而设计的。 To do this it utilizes several resources - disks/disk IO bandwidth, memory, CPU, and.network bandwidth.为此，它利用了多种资源——磁盘/磁盘 IO 带宽、memory、CPU 和网络带宽。 If you workload is well matched to Redshift your utilization of all these things will average around 60%.如果您的工作负载与 Redshift 匹配得很好，那么您对所有这些东西的平均利用率将达到 60% 左右。 Sometimes CPU bound, sometimes memory bound, sometimes.network bandwidth bound, etc. Lots of data being read means disk IO bandwidth is at a premium, lots of redistribution of data means.network IO bandwidth is constraining.有时是 CPU 限制，有时是 memory 限制，有时是网络带宽限制等。读取大量数据意味着磁盘 IO 带宽非常宝贵，大量数据重新分配意味着网络 IO 带宽受到限制。 If you are using all these factors above 50% capacity you are getting what you paid for.如果您使用所有这些因素超过 50% 的容量，您就会得到您所支付的。 Once any of these factors gets to 100% there is a significant drop-off of performance as working around the oversubscribed item steals performance.一旦这些因素中的任何一个达到 100%，性能就会显着下降，因为解决超额订阅项目会窃取性能。

Now you are in a situation where you are see 100% for a significant portion of the operating time, right?现在您处于这样一种情况，即您可以在大部分操作时间内看到 100%，对吗？ This means you have all these other attributes you have paid for but are not using AND inefficiencies are being realized to manage through this (though of all the factors, high CPU has the lease overhead).这意味着您拥有所有这些您已支付但未使用的其他属性，并且正在意识到通过此进行管理的效率低下（尽管有所有因素，高 CPU 具有租赁开销）。 The big question is why.最大的问题是为什么。

There are a few possibilities but the most likely, in my experience, is inefficiently queries.有几种可能性，但根据我的经验，最有可能的是低效查询。 An example might be the best way to explain this.一个例子可能是解释这一点的最好方法。 I've seen queries that are intended to find all the combinations of certain factors from several tables.我见过旨在从多个表中查找某些因素的所有组合的查询。 So they cross join these tables but this produces lots of repeats so they add DISTINCT, problem solved.所以他们交叉连接这些表，但这会产生很多重复，所以他们添加了 DISTINCT，问题就解决了。 But this still creates all the duplicates and then reduces the set down.但这仍然会创建所有重复项，然后减少集合。 All the work is being done and most of the results thrown away.所有的工作都在做，大部分的结果都被扔掉了。 However, if they pared down the factors in the tables first, then cross joined them, the total work will be significantly lower.然而，如果他们先削减表中的因素，然后交叉连接它们，总工作量将大大减少。 This example will do exactly what you are seeing, high CPU as it spins making repeat combinations and then throwing most of them away.此示例将完全按照您所看到的执行，高 CPU 旋转时重复组合，然后丢弃其中的大部分组合。

If you have many of this type of "fat in the middle" query where lots of extra data is made and immediately reduced, you won't get a lot of benefit for adding CPU resources.如果您有许多这种类型的“中间脂肪”查询，其中生成了大量额外数据并立即减少，那么添加 CPU 资源不会带来很多好处。 Things will get 2X faster with 2X the cluster size but you are buying 2X of all these other resources that aren't helping you.集群大小增加 2 倍，事情会变得快 2 倍，但是您购买的所有这些其他资源都是 2 倍，这些资源对您没有帮助。 You would expect that buying 2X CPU and 2X memory and 2X disk IO etc. would give you much more than a 2X improvement.您会期望购买 2X CPU 和 2X memory 以及 2X 磁盘 IO 等会给您带来比 2X 更多的改进。 Being constrained on 1 thing make scaling costly.受限于一件事会使扩展成本高昂。 Also, you are unlikely to see the CPU utilization come down as your queries just "spin the tires" of the CPU.此外，您不太可能看到 CPU 利用率下降，因为您的查询只是“转动轮胎”的 CPU。 More CPUs will just mean you can run more queries resulting in the spinning more tires.更多的 CPU 只意味着您可以运行更多的查询，从而产生更多的轮胎。

Now the above is just my #1 guess based on my consulting experience.现在，根据我的咨询经验，以上只是我的第一猜测。 It could be that your workload just isn't right for Redshift.可能是您的工作负载不适合 Redshift。 I've seen people try to put many small database problems into Redshift thinking that it's powerful so it must be good at this too.我见过人们试图将许多小的数据库问题放入 Redshift 中，认为它很强大，所以它一定也擅长于此。 They turn up the slot count to try to pump more work into Redshift but just create more issues.他们调高插槽数以尝试将更多工作注入 Redshift，但只会产生更多问题。 Or I've seem people try to run transactional workloads.或者我似乎有人试图运行事务性工作负载。 Or... If you have the wrong tool for the job it may not work well.或者...如果您使用的工具不适合这项工作，它可能无法正常工作。 One 6 ton dump truck isn't the same thing as 50 motorcycle delivery team - each has their purpose but they aren't interchangeable.一辆 6 吨自卸卡车与 50 辆摩托车送货队不同——各有各的用途，但它们不可互换。

Another possibility is that you have a very unusual workload but Redshift is still the best tool for the job.另一种可能性是您的工作量非常不寻常，但 Redshift 仍然是完成这项工作的最佳工具。 You don't need all the strengths of Redshift but this is ok, you are getting the job done at an appropriate cost.您不需要 Redshift 的所有优势，但这没关系，您正在以适当的成本完成工作。 If this case 100% CPU is just how your workload uses Redshift.在这种情况下，100% CPU 正是您的工作负载使用 Redshift 的方式。 It's not a problem, just reality.这不是问题，只是现实。 Now I doubt this is the case, but it is possible.现在我怀疑是这种情况，但这是可能的。 I'd want to be sure I'm getting all the value from the money I'm spending before assuming everything is ok.在假设一切正常之前，我想确定我从我花的钱中获得了所有价值。