I have =~ 20 bilions of events. An event is consisted of: one key (SSN), one date, and information about the event. I have 5 types of events.
Read pattern: I need to get all events from a single key less than a specific date.
Write pattern: Just a single bulk load once a day.
Imagine the database:
SSN;date(yyyymmdd);info
1;20140101;A
1;20140105;B
2;20140106;A
1;20140103;C
So if my query is: (SSN = "1" and date = "20140104") i need to get:
1;20140101;A
1;20140103;C
My first approach is:
Does anyone see performance problem in this approach? although, my key are composed using a date, i dont think it causes "monotonically increasing values", because i have a SSN first.
This a perfectly fine design. For read scans you would use startKey=sss+0 and endKey=ssn+date. You would need to allocate a fixed number of symbols for the user identifier field (SSN - 9). Row keys are sorted lexicographically. 20 bln/420 total SSNs in circulation = 47 events per SSN, assuming even distribution. That's not much but I would think about index size and any optimizations required.
Events are time-series. You might be interested in the following summary. It has 3 use cases: https://cloud.google.com/bigtable/pdf/CloudBigtableTimeSeries.pdf
The design is good given the required read/write pattern. OpenTSDB actually uses a similar schema to store timeseries data (for heavy realtime metrics, sensor data, etc...)
You do have monotonically increasing keys (given a SSN) but it does not matter for two reasons:
A minor piece of advice:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.