简体   繁体   English

Solr架构设计:拟合时间序列数据

[英]Solr schema design: fitting time-series data

I am trying to fit the following data in Solr to support flexible queries and would like to get your input on the same. 我试图在Solr中使用以下数据来支持灵活的查询,并希望得到相同的输入。 I have data about users say: 我有关于用户的数据说:

contentID (assume uuid), 
platform (eg. website, mobile etc), 
softwareVersion (eg. sw1.1, sw2.5, ..etc),
regionId (eg. us144, uk123, etc..)
....

and few more other such fields. 还有更多其他这样的领域。 This data is partially pre aggregated (read Hadoop jobs): so let's assume for "contentID = uuid123 and platform = mobile and softwareVersion = sw1.2 and regionId = ANY" I have data in format: 此数据部分预先聚合(读取Hadoop作业):因此我们假设“contentID = uuid123和platform = mobile和softwareVersion = sw1.2和regionId = ANY”我的数据格式为:

timestamp  pre-aggregated data [ uniques, total]
 Jan 15    [ 12, 4]
 Jan 14    [ 4, 3]
 Jan 13    [ 8, 7]
 ...        ...

And then I also have less granular data say "contentID = uuid123 and platform = mobile and softwareVersion = ANY and regionId = ANY (These values will be more than above table since granularity is reduced) 然后我也有更少的粒度数据说“contentID = uuid123和platform = mobile和softwareVersion = ANY和regionId = ANY(由于粒度减少,这些值将超过上表)

timestamp : pre-aggregated data [uniques, total]
 Jan 15    [ 100, 40]
 Jan 14    [ 45, 30]
 ...           ...

I'll get queries like "contentID = uuid123 and platform = mobile" , give sum of 'uniques' for Jan15 - Jan13 or for "contentID=uuid123 and platform=mobile and softwareVersion=sw1.2", give sum of 'total' for Jan15 - Jan01. 我将获得诸如“contentID = uuid123 and platform = mobile”之类的查询,给出1月15日至1月13日的“唯一性”和“contentID = uuid123和platform = mobile和softwareVersion = sw1.2”的总和,给出“总数”的总和1月15日至1月01日。

I was thinking of simple schema where documents will be like (first example above): 我在考虑简单的架构,文档就像(上面的第一个例子):

{
  "contentID": "uuid12349789",
  "platform" : "mobile",
  "softwareVersion": "sw1.2",
  "regionId": "ANY",
  "ts" : "2017-01-15T01:01:21Z",
  "unique": 12,
  "total": 4
}

second example from above: 上面的第二个例子:

{
  "contentID": "uuid12349789",
  "platform" : "mobile",
  "softwareVersion": "ANY",
  "regionId": "ANY",
  "ts" : "2017-01-15T01:01:21Z",
  "unique": 100,
  "total": 40
}

Possible optimization: 可能的优化:

{
  "contentID": "uuid12349789",
  "platform.mobile.softwareVersion.sw1.2.region.us12" : {
      "unique": 12,
      "total": 4
  },
 "platform.mobile.softwareVersion.sw1.2.region.ANY" : {
      "unique": 100,
      "total": 40
  },
  "ts" : "2017-01-15T01:01:21Z"
  }

Challenges: Number of such rows is very large and it'll grow exponentially with every new field - For instance if I go with above suggested schema, I'll end up storing a new document for each combination of contentID,platform,softwareVersion,regionId. 挑战:这些行的数量非常大,并且它会随着每个新字段呈指数级增长 - 例如,如果我使用上面建议的模式,我将最终为contentID,platform,softwareVersion,regionId的每个组合存储一个新文档。 Now if we throw in another field to this document, number of combinations increase exponentially.I have more than a billion such combination rows already. 现在,如果我们向该文档引入另一个字段,组合数量将呈指数级增长。我已经拥有超过10亿个这样的组合行。

I am hoping to find advice by experts if 我希望能找到专家的建议

  1. Multiple such fields can be fit in same document for different 'ts' such that range queries are possible on it. 对于不同的'ts',多个这样的字段可以适合于相同的文档,使得可以在其上进行范围查询。
  2. time range (ts) can be fit in same document as a list(?) (to reduce number of rows). 时间范围(ts)可以与列表(?)放在同一文档中(以减少行数)。 I know multivalued fields don't support complex data types, but if anything else can be done with the data/schema to reduce query time and number of rows. 我知道多值字段不支持复杂的数据类型,但是如果可以使用数据/模式做任何其他事情来减少查询时间和行数。

The number of these rows are very large, for sure more than 1billion (if we go with the schema I was suggesting). 这些行的数量非常大,肯定超过10亿(如果我们采用我建议的模式)。 What schema would you suggest for this that'll fit query requirements? 您会建议哪种架构适合查询要求?

FYI: All queries will be exact match on fields (no partial or tokenized), so no analysis on fields is necessary. 仅供参考:所有查询都与字段完全匹配(无部分或标记化),因此不需要对字段进行分析。 And almost all queries are range queries. 几乎所有查询都是范围查询。

You are trying to store query time results of all the possible combination of attributes values. 您正在尝试存储所有可能的属性值组合的查询时间结果。 Thats just too much duplicate data. 多数数据太多了。 Rather you store each observation and the attributes as a single data point just once. 而是将每个观察和属性存储为单个数据点一次。 so if you had 'n' observations and if you add an additional attribute, it would grow additively, not exponentially. 因此,如果你有'n'观察结果并且如果你添加了一个额外的属性,它将会成倍增加,而不是指数增长。 And if you needed data for a certain combination of attributes, you filter/aggregate them at query time. 如果您需要某些属性组合的数据,则可以在查询时过滤/聚合它们。

{
  "contentID": "uuid12349789",
  "ts" : "2017-01-15T01:01:21Z",
  "observation": 10001,

  "attr-platform" : "mobile",
  "attr-softwareVersion": "sw1.2",
  "attr-regionId": "US",
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM