I'm pretty new to the field of big data and currently stucking by a fundamental decision.
For a research project i need to store millions of log entries per minute to my Cassandra based data center, which works pretty fine. (single data center, 4 nodes)
Log Entry
------------------------------------------------------------------
| Timestamp | IP1 | IP2 ...
------------------------------------------------------------------
| 2015-01-01 01:05:01 | 10.10.10.1 | 192.10.10.1 ...
------------------------------------------------------------------
Each log entry has a specific timestamp. The log entries should be queried by different time ranges in first instance. As recommended i start to "model my query" in a big row approach.
Basic C* Schema
------------------------------------------------------------------
| row key | column key a | column key b ...
------------------------------------------------------------------
| 2015-01-01 01:05 | 2015-01-01 01:05:01 | 2015-01-01 01:05:23
------------------------------------------------------------------
Additional detail: column keys are composition of timestamp+uuid, to be unique and to avoid overwritings; log entries of a specific time are stored nearby on a node by its identical partition key;
Thus log entries are stored in shorttime intervals per row. For example every log entry for 2015-01-01 01:05
with the precision of a minute. Queries are not really peformed as a range query with an <
operator, rather entries are selected as blocks of a specified minute.
Range based queries succeed in a decent response time which is fine for me.
Question: In the next step we want to gain additional informations by queries, which are mainly focused on the IP
field. For example: select all the entries which have IP1=xx.xx.xx.xx
and IP2=yy.yy.yy.yy
.
So obviously the current model is pretty not usable for additional IP focused CQL queries. So the problem is not to find a possible solution, rather the various choices of possible technologies which could be a possible solution:
With my lack of knowledge in this field, it is pretty hard to find the best way which i should take. Especially with the feeling that the usage of a cluster computing framework would be an excessive solution.
As I understood your question, your table schema looks like this:
create table logs (
minute timestamp,
id timeuuid,
ips list<string>,
message text,
primary key (minute,id)
);
With this simple schema, you:
From my point of view, there are multiple ways of implementing this idea:
For my opinion, the C* way of doing such things is to have a set of separate tables for different queries. It will give you an ability to perform blazing-fast queries (but with increased storage cost).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.