简体   繁体   中英

Throughput vs replication factor on the read performance of cassandra

I have a cluster of 8 Cassandra nodes(Amazon EC2 instances). I'm carrying out an evaluation on the effect of increasing the replication factor on the read performance of Cassandra. No writes are performed except the initial inserts of 1 million objects. Read_Repair chance is disabled and am using a consistency level of ONE. My observation so far is that as the replication factor increases the read performance decreases. Any explanations as to why this is happening?

Depending on what kind of read you are trying to do, the read performance can decrease if the number of nodes remains the same and you increase the replication factor.

For example, if you run range queries on clustering columns, or any other query that require specifying the "allow filtering" keyword, you can observe that behaviour in theory. By increasing the replication factor, every node of the cluster will store more data: the data related to the primary range of the ring and the data related to all the partition keys for which the node is a replica. Even if Cassandra has many optimization for avoiding the degradation of performance for such queries, adding more rows in each node will produce lower performance.

For queries that use the partition key, the degradation of performance should not be observable, since there will be almost the same number of accesses to partition summary (in memory) and partition index (on disk) before reaching the data. This holds, obviously, only if you do consistency-one reads. If you observe this phenomenon in this case, I think it should be related to an increased number of cache miss (if you use key-cache, row-cache or bloom-filters, especially when you try to read non-existent data), since all these caches cannot hold all the data that is present on disk, and since now you have more data on each node, the number of hits in all caches should decrease. This can be verified using nodetool.

Of course, in case of partition-key access you have many other advantages in increasing the replication factor, since you have more replica nodes available for answering your queries. But, since your driver has more choices with higher replication factors, the probability to ask a row twice to the same node decreases. Then you have less probability of finding the row in some cache.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM