简体   繁体   中英

Hbase row key filters, range scan and cassandra abilities over this

In Hbase, I am loading data using row keys like 'app_name_ip_timestamp'. But there are going to be many such applications. So in essence I collect around 50k data points to be included every minute.

If I have to query based on ip, I can use substring filter on row key but is this a good method? Can Cassandra help in this in anyway? What are the advantages of Cassandra in this scenario? How can I make HBase fit to the situation where I can use row key subString filter and perform a range scan and retrieve results in milliseconds? what is the major difference in querying cassandra and hbase in terms of adhoc queries and partial row keys, range scans, aggregated results

I cannot talk about Cassandra, so I'll just answer your questions with HBase in mind because this type of questions have been asked multiple times here. You basically need a secondary index, which are not directly supported by HBase, please read the following documentation about it: http://hbase.apache.org/book/secondary.indexes.html

Now, according to your access pattern, I'll recommend you to manually dual-write to both the data table and to a table acting as a secondary index with 2 different types of rowkeys;

[ip_as_long]-1-[timestamp]-[appname]
[ip_as_long]-2-[appname]-[timestamp]

This table will only have one family with one column consisting of the rowkey of the data point at the data table. With a good buffer you won't experience any performance hit at all.

To query data based on the ip, just scan the index table setting the start row as "[ip_as_long]-1-" to query by timestamp, or as "[ip_as_long]-2-[appname]" to query by the app name. That scan will provide you the rowkeys that you can use to perform a multiget to the data table to retrieve it.

With this approach in mind, you can have another secondary index table with the appname as main dimension so you can query the data also by "[appname]-[timestamp]".

Recommendation: In case you have enough storage, instead of writing the rowkey to the index, I'll just write the whole data itself, that way, you can avoid having to perform a multiget.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM