简体   繁体   中英

Tuning write performance in cassandra

We have this typical scenario:

1 column family with less than 10 simple columns.

When we get request from client we need to write 10 000 000 records of this column family in database and we are writing them in batches (1000 in one batch). This usually lasts for 5-10 minutes depending on number of nodes in cluster and replication factor.

After starting writes in next few hours we will receive lots of updates (each record is updated 2 times).

So we have lots of writes/updates in one period of time in day (one hour) and after that very little.

Question is: what steps to make to improve write/update performance. I have noticed for example memtable_flush_queue_size and similar configuration fields but I don't have enough experience with cassandra to know exactly what to do.

Any suggestion is helpful,

Ivan

  1. Increase JVM memory (max 12 GB on java 6+) - this will automatically increase size of memtables and reduce flush intervals. This means also, that frequent updates will be merged together in RAM and not during compaction - this will reduce disk usage as well. Like always there is disadvantage - cassandra will need more time to start, because commit log will get larger (it's removed when memtable is flushed into SSTable)
  2. VERY IMPORTANT: use separate disk for data and for commit log. You could use SSD for data. It makes no sence for commit log, because it's sequential write.
  3. Changing replication factor to 1 will generate less load in cluster, because each node will have to take care of its data and not additionally replicas, but you might lose data - I would not recomend it.

This might help to get some better understanding:

http://maciej-miklas.blogspot.de/2012/09/cassanrda-tuning-for-frequent-column.html

http://maciej-miklas.blogspot.de/2012/08/cassandra-11-reading-and-writing-from.html

In addition to Maciej's good points, I would add at a higher level that using batches to bulk load normal writes is an antipattern. Its main effect is to make your workload more "bursty" which is Bad. Use batches only when you have writes that need to be done together for consistency.

For bulk load, consider batching them at the source and using sstableloader, but I wouldn't recommend investing that effort until the ~100M row level.

Cassandra is a log-structured database. So, it behaves the same whether it is an update or a new write. If the consistency is not very critical you can go with write consistency level to be 1. That should help a bit. And, which client you are using CQL or thrift. If you are using thrift, it is synchronous, which means each client thread will be blocked on one request. Use more client threads.

Do you actually require batching? Are the updates dependent on the previous row state? If not then I would not recommend batching as the request for batch goes to one node and the co-ordinator node have to do more work to send the request to other nodes based on their partition key. Batching is useful when you know that all the batch has one partition key only. Now, if you separate out each request the load will also get distributed more and the write throughput will also increase. Please check the below link if you want to understand batching in more detail: https://lostechies.com/ryansvihla/2014/08/28/cassandra-batch-loading-without-the-batch-keyword/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM