cassandra wide row column slice performance

Question

I am testing cql/cassandra 1.2 and the python-cql library on a vm with 2GB ram. I have a table with a compound index (wide row). When running queries against a single node I am getting about 10x worse performance than mysql. Requests are serial with no concurrency but I am interested in the speed of a single request.

Most importantly, Is there anything I can do to optimize querying wide rows (specifically this query)?
Are these numbers reflective of cassandra vs mysql performance in a single request situation?
Could my limited ram/vm be making this big of a difference?
Would multi-node cassandra / partitioned mysql be closer than 10x?
Am I doing something horribly wrong?

Test code:

"""
CREATE TABLE foo_bars (
     foo_id text,
     bar_id bigint,
     content text,
     PRIMARY KEY (foo_id, bar_id)
)
WITH CLUSTERING ORDER BY (bar_id DESC);
"""

#content is up to 64k text and te number of bar columns in a foo row will be ever growing but will probably never reach over 2million


t1 = time.time()
for i in range(1, 1000):
    sql_query = "SELECT * FROM foo_bars WHERE foo_id IN(%s) ORDER BY id DESC LIMIT 40" % random_foo_ids
    result = db_cursor.execute(sql_query)
t2 = time.time()
print "Sql time = %s" % str(t2 - t1)


t1 = time.time()
for i in range(1, 1000):
    cql_query = "SELECT * FROM foo_bars WHERE foo_id IN(%s) LIMIT 40" % radom_foo_ids
    result = cassandra_cursor.execute(cql_query)
t2 = time.time()
print "Cql time = %s" % str(t2 - t1)

Sql time = 4.2
Cql time = 58.7

Thanks In Advance!

Answer 1

You might get it a bit faster by enabling the row cache. Set row_cache_size_in_mb in cassandra.yaml to something larger than your CF size - so 100 would work. Then set caching = 'all' for your column family. As you read, you should see the hit rate increase as reported by nodetool info .

However, I doubt you will get anything like 10x speed up.

The problem is that Cassandra (in particular reads) is built for high throughput rather than low latency. There any lots of queues inside Cassandra that add to latency. Adding more nodes will further increase latency (but increasing number of nodes much beyond the replication factor shouldn't increase latency further), but give an approximately linear improvement to throughput.

The solution is to use concurrency: either queues, threads and multiple connections in your single client, or multiple clients. But if that's not possible for your use case I expect MySQL will be faster for this kind of read. Indeed, if you are only expecting to have 31 MB of data MySQL is probably better for your use case anyway.

cassandra wide row column slice performance

Question

1 answers

solution1
0 ACCPTED 2013-07-01 15:25:35

cassandra wide row column slice performance

Question

1 answers

solution1 0 ACCPTED 2013-07-01 15:25:35

solution1
0 ACCPTED 2013-07-01 15:25:35