简体   繁体   中英

HBase Design Row Key

I have =~ 20 bilions of events. An event is consisted of: one key (SSN), one date, and information about the event. I have 5 types of events.

Read pattern: I need to get all events from a single key less than a specific date.

Write pattern: Just a single bulk load once a day.

Imagine the database:

SSN;date(yyyymmdd);info
1;20140101;A
1;20140105;B
2;20140106;A
1;20140103;C

So if my query is: (SSN = "1" and date = "20140104") i need to get:

1;20140101;A
1;20140103;C

My first approach is:

  • Row Key = SSN + date.
  • One family with many columns to store information. (info:cep, info:name, ...)

Does anyone see performance problem in this approach? although, my key are composed using a date, i dont think it causes "monotonically increasing values", because i have a SSN first.

This a perfectly fine design. For read scans you would use startKey=sss+0 and endKey=ssn+date. You would need to allocate a fixed number of symbols for the user identifier field (SSN - 9). Row keys are sorted lexicographically. 20 bln/420 total SSNs in circulation = 47 events per SSN, assuming even distribution. That's not much but I would think about index size and any optimizations required.

Events are time-series. You might be interested in the following summary. It has 3 use cases: https://cloud.google.com/bigtable/pdf/CloudBigtableTimeSeries.pdf

The design is good given the required read/write pattern. OpenTSDB actually uses a similar schema to store timeseries data (for heavy realtime metrics, sensor data, etc...)

You do have monotonically increasing keys (given a SSN) but it does not matter for two reasons:

  • The leading part of the key is a SSN and its cardinality (~450MM) is way bigger than the number of nodes/region servers.
  • Most importantly, you write only once a day! Monotonically increasing keys might cause hotspotting depending on your data distribution and write patterns. Doing a Bulk Load means creating a pre-splitted table, generating HFiles 'offline', and loading them all at once without going through the HBase write pipeline (WAL, Memstore, Minor/Major Compactions). Write-time hotspotting cannot happen.

A minor piece of advice:

  • Use single character column family names: info -> d (as in 'data')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM