简体   繁体   中英

Hbase Schema/Rowkey design for browse data?

We are planning to use HBase in one of our projects.

We are getting some browse information from our internal systems, the data format is below.

Our requirement is we have to develop 3 different types of searches

  1. D IP + Date Range( Start date and End date )
  2. S IP + Date Range( Start date and End date )
  3. URL + Date Range( Start date and End date )

I am thinking to create 3 HBase tables like

  1. Row key as DestinationIP + DateTime
  2. Row key as SourceIP + DateTime
  3. Row key as URL + DateTime

If I go with the above approach it will cost us lot of space to store this data.

S IP            DateTime       Method URL        - ResponseCode - D IP -
176.204.134.111 20140421093842 GET    http://googleads.g.doubleclick.net/pagead/adview?ai=CAbmt4K5UU47XB5GS8wPOi4C4CKH1-ZwCkbiU7inAjbcBEAEgptSKH1D0-ev7B2CRdsgBAakC4V3k_lZFkj6oAwHIA4oEqgSQAU_QtfygurroekV-h5dYCoVP70qKDV1sAkiI60NNZiQ1wICQkqb5XMC3TllLKrhD0KxX0kb9-LnGkCDTqGmDE3Do-UdLGIyluqQ7MwoAcuTJMUajYKOflKPd2ZDj6RlKUAI9pbdkb96-k-XTVpON9rjUM2vUkvjwW3BwSfQk656GjoyUcEwsjwWId7p7obHcTsAEqf_DzQKSBQQIBBgBkgUECAUYBJAGAdgGAoAHueeCC5gHAQ&sigh=7zrG0DRVvMA 0 TCP_MISS/200 - 173.194.66.155 -  0
2.50.165.129    20140421093842 GET    http://www.alquds.co.uk/wp-content/uploads/2014/04/1217.jpg 0 TCP_MISS/200 - 46.165.251.78 -  0

What is a good schema design for these above requirements?

Consider using OpenTSDB , which is optimized for the storage of small key-value time series data.

Even if you don't choose to use it, definitely read this slide deck discussing the schema design decisions that went into it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM