简体   繁体   中英

Can Cassandra partition tables?

I'm inserting ~8 rows per sec, and I would like to have one big table with all rows and I want to partition this table into many tables every week. eg

select * from keyspace.rootTable; -> returns all rows from all tables
select * from keyspace.27-2016Table -> return all rows from week 27 

At 86400 seconds per day and 604800 seconds per week, you'll be storing 691200 rows per day and 4838400 rows each week. Even without knowing how wide your rows are, that's too many to return in a single query. Cassandra is great for storing lots of data like this. But querying lots of data like this...not so much.

You would probably want to partition by hour, but even that would give you 28800 rows. That's at least semi-manageable, so let's go with that.

I'd build a table that looks like this, partitioning on week and hourBucket while clustering on writeTime :

CREATE TABLE youAreAskingCassandraForTooManyRows (
  week text,
  hourBucket text,
  writeTime timestamp,
  value text,
  PRIMARY KEY ((week,hourBucket),writeTime))
WITH CLUSTERING ORDER BY (writeTime DESC);

Then I could query by a specific week and hour, just by the partition keys:

aploetz@cqlsh:stackoverflow> SELECT * 
  FROM youareaskingcassandrafortoomanyrows 
  WHERE week='201607-3' AND hourBucket ='20160713-14';

 week     | hourBucket   | writetime                | value
----------+--------------+--------------------------+--------
 201607-3 |  20160713-14 | 2016-07-13 14:01:18+0000 | value6
 201607-3 |  20160713-14 | 2016-07-13 14:01:14+0000 | value5
 201607-3 |  20160713-14 | 2016-07-13 14:01:12+0000 | value4
 201607-3 |  20160713-14 | 2016-07-13 14:01:10+0000 | value3
 201607-3 |  20160713-14 | 2016-07-13 14:01:07+0000 | value2
 201607-3 |  20160713-14 | 2016-07-13 14:01:04+0000 | value1

(6 rows)

Or even for a specific range, based on the clustering key writetime .

aploetz@cqlsh:stackoverflow> SELECT * 
  FROM youareaskingcassandrafortoomanyrows 
  WHERE week='201607-3' AND hourBucket ='20160713-14' 
    AND writetime > '2016-07-13 14:01:05+0000' 
    AND writetime < '2016-07-13 14:01:18+0000';

 week     | hourBucket   | writetime                | value
----------+--------------+--------------------------+--------
 201607-3 |  20160713-14 | 2016-07-13 14:01:14+0000 | value5
 201607-3 |  20160713-14 | 2016-07-13 14:01:12+0000 | value4
 201607-3 |  20160713-14 | 2016-07-13 14:01:10+0000 | value3
 201607-3 |  20160713-14 | 2016-07-13 14:01:07+0000 | value2

(4 rows)
 select * from keyspace.rootTable; -> returns all rows from all tables 

It should go without saying that if I think that querying an entire week's worth of 4 million-plus rows will be so huge that it will time-out, then querying your entire table is a monumentally bad idea.

Important to note, Cassandra is not a relational database. It is a distributed system, and thus running unbound queries (queries without a WHERE clause) introduces LOTS of network time into your equation. That's why you always want to specify at least a partition key(s) with all SELECT queries, because then you can guarantee that you'll be satisfying that query from a single node.

You should take a look at Patrick McFadin's article on Getting Started with Time Series Data Modeling . That should help you to understand how to partition data like this, and get you on the right path.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM