Cassandra Read/Write performance - High CPU

Question

I have started using Casandra since last few days and here is what I am trying to do.

I have about 2 Million+ objects which maintain profiles of users. I convert these objects to json, compress and store them in a blob column. The average compressed json size is about 10KB. This is how my table looks in cassandra,

Table:

dev.userprofile (uid varchar primary key, profile blob);

Select Query: select profile from dev.userprofile where uid='';

Update Query:

update dev.userprofile set profile='<bytebuffer>' where uid = '<uid>'

Every hour, I get events from a queue which I apply to my userprofile object. Each event corresponds to one userprofile object. I get about 1 Million of such events, so I have to update around 1M of the userprofile objects within a short time ie update the object in my application, compress the json and update the cassandra blob. I have to finish updating all of 1 Million user profile objects preferably in few mins. But I notice its taking longer now.

While running my application, I notice that I can update around 400 profiles/second on an average. I already see a lot of CPU iowait - 70%+ on cassandra instance. Also, the load initially is pretty high around 16 (on 8 vcpu instance) and then drops off to around 4.

What am I doing wrong? Because, when I was updating smaller objects of size 2KB I noticed that cassandra operations /sec is much faster. I was able to get about 3000 Ops/sec. Any thoughts on how I should improve the performance?

<dependency>
  <groupId>com.datastax.cassandra</groupId>
  <artifactId>cassandra-driver-core</artifactId>
  <version>3.1.0</version>
</dependency>
<dependency>
  <groupId>com.datastax.cassandra</groupId>
  <artifactId>cassandra-driver-extras</artifactId>
  <version>3.1.0</version>
</dependency>

I just have one node of cassandra setup within a m4.2xlarge aws instance for testing

Single node Cassandra instance
m4.2xlarge aws ec2
500 GB General Purpose (SSD) 
IOPS - 1500 / 10000

nodetool cfstats output

Keyspace: dev
    Read Count: 688795
    Read Latency: 27.280683695439137 ms.
    Write Count: 688780
    Write Latency: 0.010008401811899301 ms.
    Pending Flushes: 0
        Table: userprofile
        SSTable count: 9
        Space used (live): 32.16 GB
        Space used (total): 32.16 GB
        Space used by snapshots (total): 0 bytes
        Off heap memory used (total): 13.56 MB
        SSTable Compression Ratio: 0.9984539538554672
        Number of keys (estimate): 2215817
        Memtable cell count: 38686
        Memtable data size: 105.72 MB
        Memtable off heap memory used: 0 bytes
        Memtable switch count: 6
        Local read count: 688807
        Local read latency: 29.879 ms
        Local write count: 688790
        Local write latency: 0.012 ms
        Pending flushes: 0
        Bloom filter false positives: 47
        Bloom filter false ratio: 0.00003
        Bloom filter space used: 7.5 MB
        Bloom filter off heap memory used: 7.5 MB
        Index summary off heap memory used: 2.07 MB
        Compression metadata off heap memory used: 3.99 MB
        Compacted partition minimum bytes: 216 bytes
        Compacted partition maximum bytes: 370.14 KB
        Compacted partition mean bytes: 5.82 KB
        Average live cells per slice (last five minutes): 1.0
        Maximum live cells per slice (last five minutes): 1
        Average tombstones per slice (last five minutes): 1.0
        Maximum tombstones per slice (last five minutes): 1

nodetool cfhistograms output

Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)
50%             3.00              9.89           2816.16              4768                 2
75%             3.00             11.86          43388.63              8239                 2
95%             4.00             14.24         129557.75             14237                 2
98%             4.00             20.50         155469.30             17084                 2
99%             4.00             29.52         186563.16             20501                 2
Min             0.00              1.92             61.22               216                 2
Max             5.00          74975.55        4139110.98            379022                 2

Dstat output

---load-avg--- --io/total- ---procs--- ------memory-usage----- ---paging-- -dsk/total- ---system-- ----total-cpu-usage---- -net/total-
 1m   5m  15m | read  writ|run blk new| used  buff  cach  free|  in   out | read  writ| int   csw |usr sys idl wai hiq siq| recv  send
12.8 13.9 10.6|1460  31.1 |1.0  14 0.2|9.98G  892k 21.2G  234M|   0     0 | 119M 3291k|  63k   68k|  1   1  26  72   0   0|3366k 3338k
13.2 14.0 10.7|1458  28.4 |1.1  13 1.5|9.97G  884k 21.2G  226M|   0     0 | 119M 3278k|  61k   68k|  2   1  28  69   0   0|3396k 3349k
12.7 13.8 10.7|1477  27.6 |0.9  11 1.1|9.97G  884k 21.2G  237M|   0     0 | 119M 3321k|  69k   72k|  2   1  31  65   0   0|3653k 3605k
12.0 13.7 10.7|1474  27.4 |1.1 8.7 0.3|9.96G  888k 21.2G  236M|   0     0 | 119M 3287k|  71k   75k|  2   1  36  61   0   0|3807k 3768k
11.8 13.6 10.7|1492  53.7 |1.6  12 1.2|9.95G  884k 21.2G  228M|   0     0 | 119M 6574k|  73k   75k|  2   2  32  65   0   0|3888k 3829k

Edit

Switched to LeveledCompactionStrategy & disabled compression on sstables, I don't see a big improvement:

There was a bit of improvement in profiles/sec updated. Its now 550-600 profiles /sec. But, the cpu spikes remain ie the iowait.

gcstats

   Interval (ms) Max GC Elapsed (ms)Total GC Elapsed (ms)Stdev GC Elapsed (ms)   GC Reclaimed (MB)         Collections      Direct Memory Bytes
          755960                  83                3449                   8         73179796264                 107                       -1

dstats

---load-avg--- --io/total- ---procs--- ------memory-usage----- ---paging-- -dsk/total- ---system-- ----total-cpu-usage---- -net/total-
 1m   5m  15m | read  writ|run blk new| used  buff  cach  free|  in   out | read  writ| int   csw |usr sys idl wai hiq siq| recv  send
7.02 8.34 7.33| 220  16.6 |0.0   0 1.1|10.0G  756k 21.2G  246M|   0     0 |  13M 1862k|  11k   13k|  1   0  94   5   0   0|   0     0
6.18 8.12 7.27|2674  29.7 |1.2 1.5 1.9|10.0G  760k 21.2G  210M|   0     0 | 119M 3275k|  69k   70k|  3   2  83  12   0   0|3906k 3894k
5.89 8.00 7.24|2455   314 |0.6 5.7   0|10.0G  760k 21.2G  225M|   0     0 | 111M   39M|  68k   69k|  3   2  51  44   0   0|3555k 3528k
5.21 7.78 7.18|2864  27.2 |2.6 3.2 1.4|10.0G  756k 21.2G  266M|   0     0 | 127M 3284k|  80k   76k|  3   2  57  38   0   0|4247k 4224k
4.80 7.61 7.13|2485   288 |0.1  12 1.4|10.0G  756k 21.2G  235M|   0     0 | 113M   36M|  73k   73k|  2   2  36  59   0   0|3664k 3646k
5.00 7.55 7.12|2576  30.5 |1.0 4.6   0|10.0G  760k 21.2G  239M|   0     0 | 125M 3297k|  71k   70k|  2   1  53  43   0   0|3884k 3849k
5.64 7.64 7.15|1873   174 |0.9  13 1.6|10.0G  752k 21.2G  237M|   0     0 | 119M   21M|  62k   66k|  3   1  27  69   0   0|3107k 3081k

You could notice the cpu spikes.

My main concern is iowait before I increase the load further. Anything specific I should looking for thats causing this? Because, 600 profiles / sec (ie 600 Reads + Writes) seems low to me.

Answer 1

Can you try LeveledCompactionStrategy? With 1:1 read/writes on large objects like this the IO saved on reads will probably counter the IO spent on the more expensive compactions.

If your already compressing the data before sending it, you should turn off compression on the table. Its breaking it into 64kb chunks which will be largely dominated by only 6 values which wont get much compression (as shown in horrible compression ratio SSTable Compression Ratio: 0.9984539538554672 ).

ALTER TABLE dev.userprofile
  WITH compaction = { 'class' : 'LeveledCompactionStrategy'  }
  AND compression = { 'sstable_compression' : '' };

400 profiles/second is very very slow though and there may be some work to do on your client that could potentially be bottleneck as well. If you have a 4 load on a 8 core system its may not Cassandra slowing things down. Make sure your parallelizing your requests and using them asynchronously, sending requests sequentially is a common issue.

With larger blobs there is going to be an impact on GCs, so monitoring them and adding that information can be helpful. I would be surprised for 10kb objects to affect it that much but its something to look out for and may require more JVM tuning.

If that helps, from there I would recommend tuning the heap and upgrading to at least 3.7 or latest in 3.0 line.

Cassandra Read/Write performance - High CPU

Question

1 answers

solution1
1 2016-10-31 02:57:33

Cassandra Read/Write performance - High CPU

Question

1 answers

solution1 1 2016-10-31 02:57:33

solution1
1 2016-10-31 02:57:33