Cassandra java query performance count(*) or all().size()

Question

I want to know, which is faster using apache cassandra in combination with java. I have the following options to get my result:

Statement s = QueryBuilder.select().from("table").where(QueryBuilder.eq("source",source);
ResultSet resultSet = session.execute(s);
if (resultSet.all().size() == 0) {
  //Do Something
}

The second option to achieve my count is:

ResultSet rs = session.execute("SELECT COUNT(*) as coun FROM table WHERE source = '"+source+"'");
Row r = rs.one();
if (r.getLong("count") == 0) {
  //Do Something
}

In every query, the maximum count is 1. Now my question is, which would be faster in general.

Answer 1

I tested several queries on multiple tables, the version with count(*) is much faster than using resultSet.all().size() == 0. I used CQLSH to try which is faster with the following queries, which should be equal to the java one's:

SELECT COUNT(*) as coun FROM table WHERE source = '...';

And the slower one:

SELECT * FROM table WHERE source = '...';

Answer 2

You have to think both queries in terms of network traffic, and this is valid not only for cassandra but also for any request over the network (eg jdbc request, rest request)

SELECT * FROM table WHERE source = '...';

When you execute this query and then you call ResultSet#all you are retrieving all (*) the partitions (in wich the where clause holds obviously) to the memory of the process that uses the datastax driver and instantiates an ArrayList with all the Rows, to finally call a simple List#size. You have to remember that Latency is evil

(*) Note that the all method could also spawn multiple requests over the network when the number of Rows retrieved by the query is greater than the fetch size . This is more latency!

SELECT COUNT(*) as coun FROM table WHERE source = '...';

With this query you are paying latency too, but only the inevitable. This is, the RTT to send the query to the cassandra cluster and receiving the response. Since this would be a simple integer, it won't spawn multiple requests due to pagination and it will consume few bandwith.

Furthermore, IMHO it will be a better choice to use the select count (if you don't need the rows information at all) because you are being explicit in what you need, and this could give the opportunity to the server (database, web server, etc) to process the request in a specific way and improving performance. For example, if your query weren't have a where clause and you only need the total number of rows, with a select count(*) from ... the server could take advantage of an internal counter per table and serve the query faster. However, this is not the case in cassandra (because in the cassandra model it would be impossible to maintain the consistency of the counter) but I think that is clear what I mean.

Answer 3

Just call System.currentTimeMillis() for both options and print it out. If millisecond-accuracy is not enough try System.nanoTime()

long start = System.currentTimeMillis();
<YourMethod>
long end = System.currentTimeMillis();
long dif = end-start;

Cassandra java query performance count(*) or all().size()

Question

3 answers

solution1
2 ACCPTED 2015-09-17 17:48:50

solution2
1 2017-10-30 13:51:20

solution3
0 2015-09-15 13:31:53

Cassandra java query performance count(*) or all().size()

Question

3 answers

solution1 2 ACCPTED 2015-09-17 17:48:50

solution2 1 2017-10-30 13:51:20

solution3 0 2015-09-15 13:31:53

solution1
2 ACCPTED 2015-09-17 17:48:50

solution2
1 2017-10-30 13:51:20

solution3
0 2015-09-15 13:31:53