简体   繁体   中英

When is data consistency not an issue?

I am new in learning distributed systems and I read about the CAP theorem, I am interested in an AP system such as Cassandra.

My question is in what cases can you actually sacrifice consistency? Effectively what I am saying is sacrificing consistency means serving inaccurate data. In what cases would then you actually use an AP datastore like Cassandra? I can't think of any case where I wouldn't want my reads to be consistent.

By AP system, I assume you will at least target to ensure eventual consistency.

Imagine you're developing a social network where users have friends and their own news feeds. It doesn't matter if a particular user's feed has occasional five minutes lag (his feed list has eventual consistency). Missing 2/3 very recent updates in the news feed is okay in this scenario as long as those feeds will eventually appear. And in fact, Facebook built it's news feed using Cassandra.

Imagine a distributed key-value store cache system where update is very rare. If there is almost no update operations, ensuring strong consistency is un-necessary, so you can focus on availability. Occasional cache miss (the key-value entry is not populated yet) and request to database due to eventual consistency should be okay.

My question is in what cases can you actually sacrifice consistency?

One case would be when building a recommendation engine data set and serving it with Cassandra. These data sets are essentially the aggregation of many, many users to determine purchasing/viewing patterns.

For example: If I add a Rey Star Wars action figure to my shopping cart, the underlying recommendation engine runs a query for similar resulting purchasing patterns based on others who have also purchased an action figure of Rey. The query returns the top 5 product results, and puts them at the bottom of the page.

Those 5 products returned are the result of analysis and aggregation of several thousand prior purchases. Let's assume that some of that data isn't consistent, causing a variance in the 5 products returned. Is that really a big deal?

tl;dr; The real question to ask; is whether or not getting a somewhat-accurate list of 5 product recommendations in less than 10ms, is better than getting a 100% accurate list of 5 product recommendations in 100ms?

Both result sets will help drive sales. But the one which is returned fast enough that it doesn't hinder the user experience is much more preferred.

'C' in CAP refers to linearizability which is a very strong form of consistancy that you don't need most of the time.

Linearizability is a recency guarantee which makes it appear that there is a single copy of data. As soon as you make a change in the data, all subsequent reads will return the changed data. Such a level of consistency is expensive and doesn't scale well. Yet in certain scenarios we need linearizability, viz.

  1. Leader election
  2. Allowing end users to create their unique user id
  3. Distributed locking etc.

When you have these usecases, you'd use something like ZooKeeper, etcd etc. Cassandra also has Light Weight Transaction (LWT) which uses an extension of the classic Paxos algorithm to implement linearizability. This feature can be used to address those rare use cases where you must have linearizability and serializability, but it is expensive. And in vast majority of cases you are just fine with a little weaker consistency to get better scalability and performance. You trade a little bit of consistency with scalability and performance.

Some eCommerce websites send apology letter to customers for not being able to fulfill their orders. That is because the last copy of the product has been sold to more than one customers due to lack and linearizability. They prefer to deal with that over not being able to scale with the customer base and not being able to respond to their requests within stringent SLAs.

Cassandra is said to have a tuneable consistency. You may want to record user clicks or activities for analysis. You are okay if some data are lost, but you cannot compromise with the performance. You'd probably use a write consistency level of ANY with hints enabled (sloppy quorum).

If you want a little more consistency, you'd use a QUORUM consistency level to read and write along with hints and read repair. In vast majority of case all nodes are updated instantaneously. Even if one or two nodes go down, a majority of nodes will have the data and failed nodes would be repaired when they come back using hints, read repair, anti entropy repair.

Cassandra is particularly useful for cases where you'd not have many concurrent updates on same data. The reason is, unlike the dynamo architecture, it does not use vector clocks for conflict resolution between replicas. Instead it uses Last Write Wins (LWW) based on timestamp. If timestamps are same, it uses lexicographical order. Since the time on nodes cannot be accurate even in the presence of NTPD, there is a possibility of data loss, although Cassandra has taken some steps to avoid that - for eg client side timestamp instead of server side timestamp.

The CAP theorem says that given partition tolerence, you can either choose availability or consistency in a distributed database (no one would want to give up partition tolerence in any case). So if you want to have maximum availability, you'll have to give up on the consistency. This depends of course, on how critical the business is.

You answered something on SO but the answer doesn't show up when you visit the page? Can be tolerated. SO being down? Can't be. Critical financial systems would rather have strong consistency than availability. Every once-in-a-while, my bank's servers would go offline when I try to make a payment.

Normally, you choose availability and eventual consistency. The answer you wrote into SO would eventually show up.

Apart from the above mentioned cases where inconsistent data is tolerable, there are also scenarios where we can defer to the user to solve the inconsistency.

For example, if we found two different versions of someone's address in the database, we can prompt the user to identity the correct address.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM