简体   繁体   中英

Cassandra query flexibility

I'm pretty new to the field of big data and currently stucking by a fundamental decision.

For a research project i need to store millions of log entries per minute to my Cassandra based data center, which works pretty fine. (single data center, 4 nodes)

Log Entry
------------------------------------------------------------------
| Timestamp              | IP1         | IP2           ... 
------------------------------------------------------------------
| 2015-01-01 01:05:01    | 10.10.10.1  | 192.10.10.1   ...
------------------------------------------------------------------

Each log entry has a specific timestamp. The log entries should be queried by different time ranges in first instance. As recommended i start to "model my query" in a big row approach.

Basic C* Schema
------------------------------------------------------------------
| row key              | column key a         | column key b     ... 
------------------------------------------------------------------
|  2015-01-01 01:05    | 2015-01-01 01:05:01  | 2015-01-01 01:05:23
------------------------------------------------------------------

Additional detail: column keys are composition of timestamp+uuid, to be unique and to avoid overwritings; log entries of a specific time are stored nearby on a node by its identical partition key;

Thus log entries are stored in shorttime intervals per row. For example every log entry for 2015-01-01 01:05 with the precision of a minute. Queries are not really peformed as a range query with an < operator, rather entries are selected as blocks of a specified minute.

Range based queries succeed in a decent response time which is fine for me.

Question: In the next step we want to gain additional informations by queries, which are mainly focused on the IP field. For example: select all the entries which have IP1=xx.xx.xx.xx and IP2=yy.yy.yy.yy .

So obviously the current model is pretty not usable for additional IP focused CQL queries. So the problem is not to find a possible solution, rather the various choices of possible technologies which could be a possible solution:

  1. Try to solve the problem with standalone C* solutions. (Build a second model and administer the same data in a different shape)
  2. Choose additional technologies like Spark...
  3. Switch to HDFS/Hadoop - Cassandra/Hadoop solution...
  4. and so on

With my lack of knowledge in this field, it is pretty hard to find the best way which i should take. Especially with the feeling that the usage of a cluster computing framework would be an excessive solution.

As I understood your question, your table schema looks like this:

create table logs (
  minute timestamp,
  id timeuuid,
  ips list<string>,
  message text,
  primary key (minute,id)
);

With this simple schema, you:

  • can fetch all logs for a specific minute.
  • can fetch short inter-minute ranges of log events.
  • want to query dataset by IP.

From my point of view, there are multiple ways of implementing this idea:

  • create secondary index on IP addresses. But in C* you will lose the ability to query by timestamp: C* cannot merge primary and secondary indexes (like mysql/pgsql).
  • denormalize data. Write your log events to two tables at once, first being optimized for timestamp queries (minute+ts as PK), second being for IP-based queries (IP+ts as PK).
  • use spark for analytical queries. But spark will need to perform (full?) table scan (in a nifty distributed map-reduce way, but nevertheless it's a table scan) each time to extract all the data you've requested, so all your queries will require a lot of time to finish. This way can cause problems if you plan to have a lot of low-latency queries.
  • use external index like ElasticSearch for quering, and C* for storing the data.

For my opinion, the C* way of doing such things is to have a set of separate tables for different queries. It will give you an ability to perform blazing-fast queries (but with increased storage cost).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM