Cassandra CQL3 select row keys from table with compound primary key

Question

I'm using Cassandra 1.2.7 with the official Java driver that uses CQL3.

Suppose a table created by

CREATE TABLE foo ( 
    row int, 
    column int, 
    txt text, 
    PRIMARY KEY (row, column)
);

Then I'd like to preform the equivalent of SELECT DISTINCT row FROM foo

As for my understanding it should be possible to execute this query efficiently inside Cassandra's data model(given the way compound primary keys are implemented) as it would just query the 'raw' table.

I searched the CQL documentation but I didn't find any options to do that.

My backup plan is to create a separate table - something like

CREATE TABLE foo_rows (
    row int,
    PRIMARY KEY (row)
);

But this requires the hassle of keeping the two in sync - writing to foo_rows for any write in foo(also a performance penalty).

So is there any way to query for distinct row(partition) keys?

Answer 1

I'll give you the bad way to do this first. If you insert these rows:

insert into foo (row,column,txt) values (1,1,'First Insert');
insert into foo (row,column,txt) values (1,2,'Second Insert');
insert into foo (row,column,txt) values (2,1,'First Insert');
insert into foo (row,column,txt) values (2,2,'Second Insert');

Doing a

'select row from foo;'

will give you the following:

 row
-----
   1
   1
   2
   2

Not distinct since it shows all possible combinations of row and column. To query to get one row value, you can add a column value:

select row from foo where column = 1;

But then you will get this warning:

Bad Request: Cannot execute this query as it might involve data filtering and thus may  have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING

Ok. Then with this:

select row from foo where column = 1 ALLOW FILTERING;

 row
-----
   1
   2

Great. What I wanted. Let's not ignore that warning though. If you only have a small number of rows, say 10000, then this will work without a huge hit on performance. Now what if I have 1 billion? Depending on the number of nodes and the replication factor, your performance is going to take a serious hit. First, the query has to scan every possible row in the table (read full table scan) and then filter the unique values for the result set. In some cases, this query will just time out. Given that, probably not what you were looking for.

You mentioned that you were worried about a performance hit on inserting into multiple tables. Multiple table inserts are a perfectly valid data modeling technique. Cassandra can do a enormous amount of writes. As for it being a pain to sync, I don't know your exact application, but I can give general tips.

If you need a distinct scan, you need to think partition columns. This is what we call a index or query table. The important thing to consider in any Cassandra data model is the application queries. If I was using IP address as the row, I might create something like this to scan all the IP addresses I have in order.

CREATE TABLE ip_addresses (
 first_quad int,
 last_quads ascii,
 PRIMARY KEY (first_quad, last_quads)
);

Now, to insert some rows in my 192.xxx address space:

insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000001');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000002');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001001');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001255');

To get the distinct rows in the 192 space, I do this:

SELECT * FROM ip_addresses WHERE first_quad = 192;

 first_quad | last_quads
------------+------------
        192 |  000000001
        192 |  000000002
        192 |  000001001
        192 |  000001255

To get every single address, you would just need to iterate over every possible row key from 0-255. In my example, I would expect the application to be asking for specific ranges to keep things performant. Your application may have different needs but hopefully you can see the pattern here.

Answer 2

according to the documentation , from CQL version 3.11, cassandra understands DISTINCT modifier. So you can now write

SELECT DISTINCT row FROM foo

Answer 3

@edofic

Partition row keys are used as unique index to distinguish different rows in the storage engine so by nature, row keys are always distinct. You don't need to put DISTINCT in the SELECT clause

Example

 INSERT INTO foo(row,column,txt) VALUES (1,1,'1-1');
 INSERT INTO foo(row,column,txt) VALUES (2,1,'2-1');
 INSERT INTO foo(row,column,txt) VALUES (1,2,'1-2');

Then

SELECT row FROM foo

will return 2 values: 1 and 2

Below is how things are persisted in Cassandra

+----------+-------------------+------------------+
| row key | column1/value | column2/value |
+----------+-------------------+------------------+
| 1 | 1/'1' | 2/'2' |
| 2 | 1/'1' | |
+----------+-------------------+------------------+

Cassandra CQL3 select row keys from table with compound primary key

Question

3 answers

solution1
7 2013-09-01 15:59:24

solution2
4 ACCPTED 2013-11-26 15:09:18

solution3
0 2013-09-01 15:44:35

Cassandra CQL3 select row keys from table with compound primary key

Question

3 answers

solution1 7 2013-09-01 15:59:24

solution2 4 ACCPTED 2013-11-26 15:09:18

solution3 0 2013-09-01 15:44:35

solution1
7 2013-09-01 15:59:24

solution2
4 ACCPTED 2013-11-26 15:09:18

solution3
0 2013-09-01 15:44:35