简体   繁体   中英

Regarding Cassandra Read Performance

I am working on a sensor data(timeseries). Number of columns in a table is 3000.
for eg: nodeid,timestamp,sen1,sen2,.....sen-n . nodeid and timestamp are primary key with clustering order by timestamp.
Number of records are 10000.

When a SELECT query for single column( SELECT timestamp,sen1 FROM <table> ) requ is requsted through cassandra datastax java driver 3.0 it is replies in 15 sec; ie If I want read all the tags, one a tag at a time for all 3000 tags requires 3000*15 sec = 12 to 13 hours aproximately. It is on single node cluster with 16GB RAM.
I allocated 10GB for JVM. Still response time is not changed. I used LevelCompactionStragy at the time of table creation.

Hardware: Intel Core i7 and Normal Hard disk not SSD,8GB RAM How to reduce that read or query time on the single node cluster?

Obviously, there is problem with data modelling. IMO, a table with 3000 columns is bad. if your use case is like "SELECT timestamp,sen1 FROM ", then you should model it as " Primary Key(Timestamp, SensorId) ".

"SELECT timestamp,sen1" in your model, cassandra will still read all other column values from disk into memory.

I am not sure what is 'nodeId' in your case.. I hope it's not cassandra node id..

(SELECT timestamp,sen1 FROM table)

This is like getting all the data at once(in your case 10000 records). So getting 1 column or 3000 columns will make Cassandra server to read through all the SSTables. The point is it won't be 12 or 13 hours.

Still 15 seconds seems unbelievable. Did you also include the network latency and client side write in this measure?

As mentioned in one of the answers your model seems to be bad (If you put timestamp as partion key, the data becomes two sparse and getting a range of data will need to read from more than one partition. If you use only node_id as partition key, the partition will host too much data and can cross the C* limitation of 2 Billion). My advise is

  • Redesign your partition key. Please check this tutorial for a start. https://academy.datastax.com/resources/getting-started-time-series-data-modeling
  • Add more no. of nodes and increase replication factor to see better read latencies.
  • Try to design your read query such that it reads from only one partition at once. eg: SELECT * from Table where sensor_node_id = abc and year = 2016 and month = June

Hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM