简体繁体 English

这个计算正确吗？（车复制）

[英]Is this calculation correct ? (rook replication)

原文 2020-11-12 16:24:22 9 1 storage/ ceph/ cephfs/ rook-storage/ kubernetes-rook

If 1 OSD crashes, does rook-ceph eventually tries to replicate missing data to the still workings OSDs or does it wait for all OSD to be healthy again ?如果 1 个 OSD 崩溃，rook-ceph 最终会尝试将丢失的数据复制到仍在工作的 OSD 上还是等待所有 OSD 再次恢复正常？ Let's say yes so that I can explain how I calculated :让我们说是，这样我就可以解释我是如何计算的：

I started with 1,71 TB provisionned for kubernetes PVCs and 3 nodes of 745 GB each (total 2,23 TB).我首先为 kubernetes PVC 配置了 1,71 TB 和 3 个 745 GB 的节点（总共 2,23 TB）。 Rook has a replication factor of 2 (RF=2). Rook 的复制因子为 2 (RF=2)。

For the replication to work, I need 2 times 1,71 TB (3,42 TB), so I added 2 nodes 745 GB each (total 3,72 TB) Let's say I use all of the 1,71 TB provisonned.为了使复制工作，我需要 2 次 1,71 TB（3,42 TB），所以我添加了 2 个节点，每个 745 GB（总共 3,72 TB）假设我使用了所有 1,71 TB 提供的容量。

If I lose an OSD, my K8S cluster still runs because data is replicated, but when missing data is replicated itself on still working OSD, other OSD may crash because, assuming OSDs are always equally distributed (which I know is not true in the long run) :如果我丢失了一个 OSD，我的 K8S 集群仍然运行，因为数据被复制了，但是当丢失的数据在仍在工作的 OSD 上自行复制时，其他 OSD 可能会崩溃，因为，假设 OSD 总是平均分布（我知道这不是真的）跑）：

I have 290 GB unused space on my cluster (3,72 total - 3,42 PVC provisionning)我的集群上有 290 GB 未使用空间（总共 3,72 - 3,42 PVC 配置）
Which is 58 GB per OSD (290 / 5)每个 OSD 为 58 GB (290 / 5)
Crashed OSD has 687 GB (745 disk total - 58 GB unused)崩溃的 OSD 有 687 GB（总共 745 个磁盘 - 58 GB 未使用）
Ceph tries to replicate 172 GB missing data on each OSD left (687/4) Ceph 尝试在每个剩余的 OSD 上复制 172 GB 的缺失数据 (687/4)
Which is way too much because we only have 58 GB left which should lead to OSD failures cascading这太多了，因为我们只剩下 58 GB，这应该会导致 OSD 故障级联

If I had 6 nodes instead of 5, I could loose 1 OSD indefinitely tho :如果我有 6 个节点而不是 5 个，我可以无限期地丢失 1 个 OSD：

New pool is 4,5 TB (6x745)新池为 4.5 TB (6x745)
I have 1+ TB free space on the cluster (4,5 total - 3,42 PVC provisionning)我在集群上有 1+ TB 的可用空间（总共 4,5 - 3,42 PVC 配置）
Which is 166+ GB per OSD (~1 TB / 6)每个 OSD 超过 166 GB（~1 TB / 6）
Crashed OSD has 579+ GB data max.崩溃的 OSD 最多有 579+ GB 的数据。 (745 - 166) (745 - 166)
Ceph tries to replicate less than 100 GB missing data on each OSD left (579 / 6) Ceph 尝试在每个 OSD 上复制少于 100 GB 的缺失数据 (579 / 6)
Which is less than free space on each OSD (166+ GB) so replication works again with only 5 nodes left but if another OSD crashes I'm doomed.这比每个 OSD 上的可用空间（166+ GB）都少，所以复制再次工作，只剩下 5 个节点，但如果另一个 OSD 崩溃，我注定要失败。

Is the initial assumption correct?最初的假设是否正确？ If so, does the maths sound right to you ?如果是这样，数学听起来对你合适吗？

1 个解决方案

First: if you value your data, don't use replication with size 2!第一：如果您重视数据，请不要使用大小为 2 的复制！ You will eventually have issues leading to data loss.您最终会遇到导致数据丢失的问题。

Regarding your calculation: Ceph doesn't distribute every MB of data evenly across all nodes, there will be differences between your OSDs.关于你的计算：Ceph 不会在所有节点上平均分配每 MB 数据，你的 OSD 之间会有差异。 Because of that the OSD with the most data will be your bottleneck regarding free space and the capacity to rebalance after a failure.因此，数据最多的 OSD 将成为可用空间和故障后重新平衡能力的瓶颈。 Ceph also doesn't handle full or near full clusters very well, your calculation is very close to a full cluster, that will lead to new issues. Ceph 也不能很好地处理完整或接近完整的集群，您的计算非常接近完整集群，这将导致新问题。 Try avoiding a cluster with more than 85 or 90 % used capacity, plan ahead and use more disks to both avoid a full cluster and also have a higher failure resistency.尽量避免使用超过 85% 或 90% 已用容量的集群，提前计划并使用更多磁盘，以避免出现满集群并具有更高的抗故障能力。 The more OSDs you have the less impact a single disk failure will have on the rest of the cluster.您拥有的 OSD 越多，单个磁盘故障对集群其余部分的影响就越小。

And regarding recovery: ceph usually tries to recovery automatically but it depends on your actual crushmap and the rulesets your pools are configured with.关于恢复：ceph 通常会尝试自动恢复，但这取决于您的实际挤压图和池配置的规则集。 For example, if you have a crush tree consisting of 3 racks and your pool is configured with size 3 (so 3 replicas in total) spread across your 3 racks (failure-domain = rack), then a whole rack fails.例如，如果您有一个由 3 个机架组成的粉碎树，并且您的池配置为大小为 3（因此总共 3 个副本）分布在您的 3 个机架上（故障域 = 机架），那么整个机架都会发生故障。 In this example ceph won't be able to recover the third replica until the rack is online again.在这个例子中，ceph 将无法恢复第三个副本，直到机架再次在线。 The data is still available to clients and all, but your cluster is in a degraded state.数据仍然可供客户端和所有人使用，但您的集群处于降级状态。 But this configuration has to be done manually so it probably won't apply to you, I just wanted to point out how that works.但是这个配置必须手动完成，所以它可能不适用于你，我只是想指出它是如何工作的。 The default usually is a pool with size 3 with host as failure-domain.默认值通常是大小为 3 的池，主机作为故障域。