简体   繁体   中英

Slow Insert Time With Composite Primary Key in Cassandra

I have been working with Cassandra and I have hit a bit of a stumbling block. For how I need to search on data I found that a Composite primary key works great for what I need but the insert times for the record in this Column Family go to the dogs with it and I am not entirely sure why.

Table Definition:

CREATE TABLE exampletable (
clientid int,
filledday int,
filledtime bigint,
id uuid,
...etc...
PRIMARY KEY (clientid, filledday, filledtime, id)
);

clientid = The internal id of the client. filledday = The number of days since 1/1/1900. filledtime = The number of ticks of the day at which the record was recived. id = A Guid.

The day and time structure exists because I need to be able to filter by day easily and quickly.

I know Cassandra stores Column Families with composite primary keys quite differently. From what I understand it will store the everything as new columns off of a base row of the main component of the primary key. Is that the reason the inserts would be slow? When I say slow I mean that if I just have a primary key on id the insert will take ~200 milliseconds and with the composite primary key (or any subset of it, I tried just clientid and id to the same effect) it will take upwards of 32 seconds for 1000 records. The Select times are faster out of the composite key table since I have to apply secondary indexes and use 'ALLOW FILTERING' in order to get the proper records back with the standard key table (I know I could do this in code but the concern is that I am dealing with some massive data sets and that will not always be practical or possible).

Am I declaring the Column Family or the Primary Key wrong for what I am trying to do? With all the unlisted, non-primary key columns the table is 37 columns wide, would that be the problem? I am quite stumped at this point. I have not be able to really find anything about others having similar problems.

Well, your partition key is the client id, so all writes per client go to one node. If you are writing lots of data per client, you could end up with a hotspot, thus decreasing your overall throughput.

Also, could you give an example of the queries that you run? In Cassandra, the data model always need to resemble the queries you want to run. If you need to "allow filtering", then it seems that something is not quite right with your data model. For instance, I don't really see the point of "filledtime" in your PK. If you want to query by time period, just replace your three column keys with a TimeUUID column "ts". This would create a wide row, with one column per entry with a unique timestam, clustered/partitioned per client id. This allows queries like:

select * from exampletable where clientid = 123 and ts > minTimeuuid('2013-06-18 16:23:00') and ts < minTimeuuid('2013-06-18 16:24:00');

Again, this would depend on the queries you actually need to run.

And lastly, for overall guidance on data modelling, take a look into this ebay tech blog . Reading it helped me cleared up some things for me.

Hope that helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM