Composite columns and “IN” relation in Cassandra

Question

I have the following column family in Cassandra for storing time series data in a small number of very "wide" rows:

CREATE TABLE data_bucket (
  day_of_year int,
  minute_of_day int,
  event_id int,
  data ascii,
  PRIMARY KEY (data_of_year, minute_of_day, event_id)
)

On the CQL shell, I am able to run a query such as this:

select * from data_bucket where day_of_year = 266 and minute_of_day = 244 
  and event_id in (4, 7, 11, 1990, 3433)

Essentially, I fix the value of the first component of the composite column name (minute_of_day) and want to select a non-contiguous set of columns based on the distinct values of the second component (event_id). Since the "IN" relation is interpreted as an equality relation, this works fine.

Now my question is, how would I accomplish the same type of composite column slicing programmatically and without CQL. So far I have tried the Python client pycassa and the Java client Astyanax, but without any success.

Any thoughts would be welcome.

EDIT:

I'm adding the describe output of the column family as seen through cassandra-cli. Since I am looking for a Thrift-based solution, maybe this will help.

ColumnFamily: data_bucket
  Key Validation Class: org.apache.cassandra.db.marshal.Int32Type
  Default column value validator: org.apache.cassandra.db.marshal.AsciiType
  Cells sorted by: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.Int32Type,org.apache.cassandra.db.marshal.Int32Type)
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 0.1
  DC Local Read repair chance: 0.0
  Populate IO Cache on flush: false
  Replicate on write: true
  Caching: KEYS_ONLY
  Bloom Filter FP chance: default
  Built indexes: []
  Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
  Compression Options:
    sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor

Answer 1

There is no "IN"-type query in the Thrift API. You could perform a series of get queries for each composite column value ( day_of_year , minute_of_day , event_id ).

If your event_id s were sequential (and your question says they are not) you could perform a single get_slice query, passing in the range (eg, day_of_year , minute_of_day , and range of event_id s). You could grab bunches of them in this way and filter the response programatically yourself (eg, grab all data on the date with event ids between 4-3433). More data transfer, more processing on the client side so not a great option unless you really are looking for a range.

So, if you want to use "IN" with Cassandra you will need to switch to a CQL-based solution. If you are considering using CQL in python another option is cassandra-dbapi2 . This worked for me:

import cql

# Replace settings as appropriate
host = 'localhost'
port = 9160
keyspace = 'keyspace_name'

# Connect
connection = cql.connect(host, port, keyspace, cql_version='3.0.1')
cursor = connection.cursor()
print "connected!"

# Execute CQL
cursor.execute("select * from data_bucket where day_of_year = 266 and minute_of_day = 244 and event_id in (4, 7, 11, 1990, 3433)")
for row in cursor:
  print str(row) # Do something with your data

# Shut the connection
cursor.close()
connection.close()

(Tested with Cassandra 2.0.1.)

Composite columns and “IN” relation in Cassandra

Question

1 answers

solution1
1 ACCPTED 2013-09-24 22:15:17

Composite columns and “IN” relation in Cassandra

Question

1 answers

solution1 1 ACCPTED 2013-09-24 22:15:17

solution1
1 ACCPTED 2013-09-24 22:15:17