简体   繁体   中英

Solr schema design: fitting time-series data

I am trying to fit the following data in Solr to support flexible queries and would like to get your input on the same. I have data about users say:

contentID (assume uuid), 
platform (eg. website, mobile etc), 
softwareVersion (eg. sw1.1, sw2.5, ..etc),
regionId (eg. us144, uk123, etc..)
....

and few more other such fields. This data is partially pre aggregated (read Hadoop jobs): so let's assume for "contentID = uuid123 and platform = mobile and softwareVersion = sw1.2 and regionId = ANY" I have data in format:

timestamp  pre-aggregated data [ uniques, total]
 Jan 15    [ 12, 4]
 Jan 14    [ 4, 3]
 Jan 13    [ 8, 7]
 ...        ...

And then I also have less granular data say "contentID = uuid123 and platform = mobile and softwareVersion = ANY and regionId = ANY (These values will be more than above table since granularity is reduced)

timestamp : pre-aggregated data [uniques, total]
 Jan 15    [ 100, 40]
 Jan 14    [ 45, 30]
 ...           ...

I'll get queries like "contentID = uuid123 and platform = mobile" , give sum of 'uniques' for Jan15 - Jan13 or for "contentID=uuid123 and platform=mobile and softwareVersion=sw1.2", give sum of 'total' for Jan15 - Jan01.

I was thinking of simple schema where documents will be like (first example above):

{
  "contentID": "uuid12349789",
  "platform" : "mobile",
  "softwareVersion": "sw1.2",
  "regionId": "ANY",
  "ts" : "2017-01-15T01:01:21Z",
  "unique": 12,
  "total": 4
}

second example from above:

{
  "contentID": "uuid12349789",
  "platform" : "mobile",
  "softwareVersion": "ANY",
  "regionId": "ANY",
  "ts" : "2017-01-15T01:01:21Z",
  "unique": 100,
  "total": 40
}

Possible optimization:

{
  "contentID": "uuid12349789",
  "platform.mobile.softwareVersion.sw1.2.region.us12" : {
      "unique": 12,
      "total": 4
  },
 "platform.mobile.softwareVersion.sw1.2.region.ANY" : {
      "unique": 100,
      "total": 40
  },
  "ts" : "2017-01-15T01:01:21Z"
  }

Challenges: Number of such rows is very large and it'll grow exponentially with every new field - For instance if I go with above suggested schema, I'll end up storing a new document for each combination of contentID,platform,softwareVersion,regionId. Now if we throw in another field to this document, number of combinations increase exponentially.I have more than a billion such combination rows already.

I am hoping to find advice by experts if

  1. Multiple such fields can be fit in same document for different 'ts' such that range queries are possible on it.
  2. time range (ts) can be fit in same document as a list(?) (to reduce number of rows). I know multivalued fields don't support complex data types, but if anything else can be done with the data/schema to reduce query time and number of rows.

The number of these rows are very large, for sure more than 1billion (if we go with the schema I was suggesting). What schema would you suggest for this that'll fit query requirements?

FYI: All queries will be exact match on fields (no partial or tokenized), so no analysis on fields is necessary. And almost all queries are range queries.

You are trying to store query time results of all the possible combination of attributes values. Thats just too much duplicate data. Rather you store each observation and the attributes as a single data point just once. so if you had 'n' observations and if you add an additional attribute, it would grow additively, not exponentially. And if you needed data for a certain combination of attributes, you filter/aggregate them at query time.

{
  "contentID": "uuid12349789",
  "ts" : "2017-01-15T01:01:21Z",
  "observation": 10001,

  "attr-platform" : "mobile",
  "attr-softwareVersion": "sw1.2",
  "attr-regionId": "US",
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM