简体   繁体   English

什么时候数据一致性不是问题?

[英]When is data consistency not an issue?

I am new in learning distributed systems and I read about the CAP theorem, I am interested in an AP system such as Cassandra.我是学习分布式系统的新手,我阅读了 CAP 定理,我对 Cassandra 等 AP 系统感兴趣。

My question is in what cases can you actually sacrifice consistency?我的问题是在什么情况下你可以牺牲一致性? Effectively what I am saying is sacrificing consistency means serving inaccurate data.实际上,我所说的是牺牲一致性意味着提供不准确的数据。 In what cases would then you actually use an AP datastore like Cassandra?在什么情况下,您实际上会使用像 Cassandra 这样的 AP 数据存储? I can't think of any case where I wouldn't want my reads to be consistent.我想不出任何我不希望我的阅读保持一致的情况。

By AP system, I assume you will at least target to ensure eventual consistency.通过 AP 系统,我假设您至少会以确保最终一致性为目标。

Imagine you're developing a social network where users have friends and their own news feeds.想象一下,您正在开发一个社交网络,其中用户有朋友和他们自己的新闻提要。 It doesn't matter if a particular user's feed has occasional five minutes lag (his feed list has eventual consistency).特定用户的提要是否偶尔滞后五分钟并不重要(他的提要列表最终具有一致性)。 Missing 2/3 very recent updates in the news feed is okay in this scenario as long as those feeds will eventually appear.在这种情况下,新闻提要中缺少 2/3 的最近更新是可以的,只要这些提要最终会出现。 And in fact, Facebook built it's news feed using Cassandra.事实上,Facebook 使用 Cassandra 构建了它的新闻提要。

Imagine a distributed key-value store cache system where update is very rare.想象一个分布式键值存储缓存系统,其中更新非常少。 If there is almost no update operations, ensuring strong consistency is un-necessary, so you can focus on availability.如果几乎没有更新操作,则无需确保强一致性,因此您可以专注于可用性。 Occasional cache miss (the key-value entry is not populated yet) and request to database due to eventual consistency should be okay.偶尔的缓存未命中(键值条目尚未填充)和由于最终一致性对数据库的请求应该没问题。

My question is in what cases can you actually sacrifice consistency?我的问题是在什么情况下你可以牺牲一致性?

One case would be when building a recommendation engine data set and serving it with Cassandra.一种情况是构建推荐引擎数据集并使用 Cassandra 为其提供服务。 These data sets are essentially the aggregation of many, many users to determine purchasing/viewing patterns.这些数据集本质上是许多用户的聚合,以确定购买/查看模式。

For example: If I add a Rey Star Wars action figure to my shopping cart, the underlying recommendation engine runs a query for similar resulting purchasing patterns based on others who have also purchased an action figure of Rey.例如:如果我将 Rey Star Wars 人偶添加到我的购物车中,基础推荐引擎会根据其他人也购买了 Rey 人偶的人偶来运行查询,以获取类似的结果购买模式。 The query returns the top 5 product results, and puts them at the bottom of the page.该查询返回前 5 个产品结果,并将它们放在页面底部。

Those 5 products returned are the result of analysis and aggregation of several thousand prior purchases.这5件退回的产品是对之前数千次购买的分析和汇总的结果。 Let's assume that some of that data isn't consistent, causing a variance in the 5 products returned.让我们假设其中一些数据不一致,导致返回的 5 种产品出现差异。 Is that really a big deal?这真的是一件大事吗?

tl;dr; tl;博士; The real question to ask;要问的真正问题; is whether or not getting a somewhat-accurate list of 5 product recommendations in less than 10ms, is better than getting a 100% accurate list of 5 product recommendations in 100ms?在不到 10 毫秒内获得 5 个产品推荐的有点准确的列表,是否比在 100 毫秒内获得 5 个产品推荐的 100% 准确列表更好?

Both result sets will help drive sales.这两个结果集都将有助于推动销售。 But the one which is returned fast enough that it doesn't hinder the user experience is much more preferred.但返回速度不够快,它不妨碍用户体验的一个更加优选。

'C' in CAP refers to linearizability which is a very strong form of consistancy that you don't need most of the time. CAP 中的“C”指的是线性化,这是一种非常强的一致性形式,您在大多数情况下都不需要。

Linearizability is a recency guarantee which makes it appear that there is a single copy of data.线性化是一种新近度保证,它使数据看起来只有一个副本。 As soon as you make a change in the data, all subsequent reads will return the changed data.一旦您对数据进行了更改,所有后续读取都将返回更改后的数据。 Such a level of consistency is expensive and doesn't scale well.这种级别的一致性代价高昂,而且不能很好地扩展。 Yet in certain scenarios we need linearizability, viz.然而在某些情况下,我们需要线性化,即。

  1. Leader election领导选举
  2. Allowing end users to create their unique user id允许最终用户创建他们唯一的用户 ID
  3. Distributed locking etc.分布式锁等

When you have these usecases, you'd use something like ZooKeeper, etcd etc. Cassandra also has Light Weight Transaction (LWT) which uses an extension of the classic Paxos algorithm to implement linearizability.当你有这些用例时,你会使用像 ZooKeeper、etcd 等的东西。 Cassandra 也有轻量级事务 (LWT),它使用经典 Paxos 算法的扩展来实现线性化。 This feature can be used to address those rare use cases where you must have linearizability and serializability, but it is expensive.此功能可用于解决那些必须具有线性化和可序列化性的罕见用例,但它很昂贵。 And in vast majority of cases you are just fine with a little weaker consistency to get better scalability and performance.在绝大多数情况下,稍微弱一点的一致性就可以了,以获得更好的可扩展性和性能。 You trade a little bit of consistency with scalability and performance.您可以用可扩展性和性能来换取一点一致性。

Some eCommerce websites send apology letter to customers for not being able to fulfill their orders.一些电子商务网站向客户发送道歉信,因为他们无法完成订单。 That is because the last copy of the product has been sold to more than one customers due to lack and linearizability.这是因为由于缺乏和线性化,该产品的最后一个副本已出售给多个客户。 They prefer to deal with that over not being able to scale with the customer base and not being able to respond to their requests within stringent SLAs.他们更愿意处理这个问题,而不是无法与客户群一起扩展,也无法在严格的 SLA 内响应他们的请求。

Cassandra is said to have a tuneable consistency.据说 Cassandra 具有可调整的一致性。 You may want to record user clicks or activities for analysis.您可能希望记录用户点击或活动以进行分析。 You are okay if some data are lost, but you cannot compromise with the performance.如果丢失了一些数据,你可以,但你不能在性能上妥协。 You'd probably use a write consistency level of ANY with hints enabled (sloppy quorum).您可能会在启用提示的情况下使用 ANY 的写入一致性级别(草率的法定人数)。

If you want a little more consistency, you'd use a QUORUM consistency level to read and write along with hints and read repair.如果您想要更多的一致性,您可以使用 QUORUM 一致性级别来读取和写入提示以及读取修复。 In vast majority of case all nodes are updated instantaneously.在绝大多数情况下,所有节点都会立即更新。 Even if one or two nodes go down, a majority of nodes will have the data and failed nodes would be repaired when they come back using hints, read repair, anti entropy repair.即使一两个节点宕机,大多数节点都会有数据,故障节点会在他们回来时使用提示、读取修复、反熵修复进行修复。

Cassandra is particularly useful for cases where you'd not have many concurrent updates on same data. Cassandra 对于您不会对同一数据进行多次并发更新的情况特别有用。 The reason is, unlike the dynamo architecture, it does not use vector clocks for conflict resolution between replicas.原因是,与 dynamo 架构不同,它不使用矢量时钟来解决副本之间的冲突。 Instead it uses Last Write Wins (LWW) based on timestamp.相反,它使用基于时间戳的 Last Write Wins (LWW)。 If timestamps are same, it uses lexicographical order.如果时间戳相同,则使用字典顺序。 Since the time on nodes cannot be accurate even in the presence of NTPD, there is a possibility of data loss, although Cassandra has taken some steps to avoid that - for eg client side timestamp instead of server side timestamp.由于节点上的时间即使在存在 NTPD 的情况下也无法准确,因此存在数据丢失的可能性,尽管 Cassandra 已采取一些措施来避免这种情况 - 例如,客户端时间戳而不是服务器端时间戳。

The CAP theorem says that given partition tolerence, you can either choose availability or consistency in a distributed database (no one would want to give up partition tolerence in any case). CAP 定理说,给定分区容忍度,您可以选择分布式数据库中的可用性或一致性(在任何情况下都没有人愿意放弃分区容忍度)。 So if you want to have maximum availability, you'll have to give up on the consistency.因此,如果您想获得最大的可用性,就必须放弃一致性。 This depends of course, on how critical the business is.这当然取决于业务的重要性。

You answered something on SO but the answer doesn't show up when you visit the page?您在 SO 上回答了一些问题,但在您访问该页面时没有显示答案? Can be tolerated.可以忍受。 SO being down?所以正在下降? Can't be.不能。 Critical financial systems would rather have strong consistency than availability.关键金融系统宁愿具有强一致性而不是可用性。 Every once-in-a-while, my bank's servers would go offline when I try to make a payment.每隔一段时间,当我尝试付款时,我银行的服务器就会脱机。

Normally, you choose availability and eventual consistency.通常,您选择可用性和最终一致性。 The answer you wrote into SO would eventually show up.您写到 SO 中的答案最终会出现。

Apart from the above mentioned cases where inconsistent data is tolerable, there are also scenarios where we can defer to the user to solve the inconsistency.除了上面提到的数据不一致的情况是可以容忍的,还有一些场景我们可以交给用户来解决不一致的问题。

For example, if we found two different versions of someone's address in the database, we can prompt the user to identity the correct address.例如,如果我们在数据库中发现某人地址的两个不同版本,我们可以提示用户识别正确的地址。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM