We have this typical scenario:
1 column family with less than 10 simple columns.
When we get request from client we need to write 10 000 000 records of this column family in database and we are writing them in batches (1000 in one batch). This usually lasts for 5-10 minutes depending on number of nodes in cluster and replication factor.
After starting writes in next few hours we will receive lots of updates (each record is updated 2 times).
So we have lots of writes/updates in one period of time in day (one hour) and after that very little.
Question is: what steps to make to improve write/update performance. I have noticed for example memtable_flush_queue_size and similar configuration fields but I don't have enough experience with cassandra to know exactly what to do.
Any suggestion is helpful,
Ivan
This might help to get some better understanding:
http://maciej-miklas.blogspot.de/2012/09/cassanrda-tuning-for-frequent-column.html
http://maciej-miklas.blogspot.de/2012/08/cassandra-11-reading-and-writing-from.html
In addition to Maciej's good points, I would add at a higher level that using batches to bulk load normal writes is an antipattern. Its main effect is to make your workload more "bursty" which is Bad. Use batches only when you have writes that need to be done together for consistency.
For bulk load, consider batching them at the source and using sstableloader, but I wouldn't recommend investing that effort until the ~100M row level.
Cassandra is a log-structured database. So, it behaves the same whether it is an update or a new write. If the consistency is not very critical you can go with write consistency level to be 1. That should help a bit. And, which client you are using CQL or thrift. If you are using thrift, it is synchronous, which means each client thread will be blocked on one request. Use more client threads.
Do you actually require batching? Are the updates dependent on the previous row state? If not then I would not recommend batching as the request for batch goes to one node and the co-ordinator node have to do more work to send the request to other nodes based on their partition key. Batching is useful when you know that all the batch has one partition key only. Now, if you separate out each request the load will also get distributed more and the write throughput will also increase. Please check the below link if you want to understand batching in more detail: https://lostechies.com/ryansvihla/2014/08/28/cassandra-batch-loading-without-the-batch-keyword/
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.