HBase Design Row Key

Question

I have =~ 20 bilions of events. An event is consisted of: one key (SSN), one date, and information about the event. I have 5 types of events.

Read pattern: I need to get all events from a single key less than a specific date.

Write pattern: Just a single bulk load once a day.

Imagine the database:

SSN;date(yyyymmdd);info
1;20140101;A
1;20140105;B
2;20140106;A
1;20140103;C

So if my query is: (SSN = "1" and date = "20140104") i need to get:

1;20140101;A
1;20140103;C

My first approach is:

Row Key = SSN + date.
One family with many columns to store information. (info:cep, info:name, ...)

Does anyone see performance problem in this approach? although, my key are composed using a date, i dont think it causes "monotonically increasing values", because i have a SSN first.

Answer 1

This a perfectly fine design. For read scans you would use startKey=sss+0 and endKey=ssn+date. You would need to allocate a fixed number of symbols for the user identifier field (SSN - 9). Row keys are sorted lexicographically. 20 bln/420 total SSNs in circulation = 47 events per SSN, assuming even distribution. That's not much but I would think about index size and any optimizations required.

Events are time-series. You might be interested in the following summary. It has 3 use cases: https://cloud.google.com/bigtable/pdf/CloudBigtableTimeSeries.pdf

Answer 2

The design is good given the required read/write pattern. OpenTSDB actually uses a similar schema to store timeseries data (for heavy realtime metrics, sensor data, etc...)

You do have monotonically increasing keys (given a SSN) but it does not matter for two reasons:

The leading part of the key is a SSN and its cardinality (~450MM) is way bigger than the number of nodes/region servers.
Most importantly, you write only once a day! Monotonically increasing keys might cause hotspotting depending on your data distribution and write patterns. Doing a Bulk Load means creating a pre-splitted table, generating HFiles 'offline', and loading them all at once without going through the HBase write pipeline (WAL, Memstore, Minor/Major Compactions). Write-time hotspotting cannot happen.

A minor piece of advice:

Use single character column family names: info -> d (as in 'data')

HBase Design Row Key

Question

2 answers

solution1
0 2015-05-08 22:50:30

solution2
0 2016-09-30 15:08:15

HBase Design Row Key

Question

2 answers

solution1 0 2015-05-08 22:50:30

solution2 0 2016-09-30 15:08:15

solution1
0 2015-05-08 22:50:30

solution2
0 2016-09-30 15:08:15