简体   繁体   English

使用Solr搜索数值数据

[英]Searching numeric data using Solr

I am using Solr for (an unusual?) use-case of providing ranked results for numeric data./ 我正在Solr用于(为不寻常的?)用例提供数字数据的排名结果。/

  1. Say I have a record-set of a set of Objects O {O1...On} and for each of those objects I have multiple measurements: eg Viscosity, Porosity, Permeability etc. 假设我有一组对象O {O1 ... On}的记录集,对于这些对象中的每一个,我都有多种测量值:例如粘度,孔隙率,渗透率等。

  2. For an On+1 object, I need to search the above record-set to find the most "similar" (along the multiple dimensions of Viscosity, Porosity, Permeability) etc. 对于On + 1对象,我需要搜索上面的记录集以找到最“相似”的对象(沿着粘度,孔隙率,磁导率的多个维度)等。

  3. Since the record-set O is hundreds of millions records, it is practically impossible to run against each a similarity metric such as Cosine, or Minkowski. 由于记录集O是数亿条记录,因此几乎不可能对每个相似度指标(例如Cosine或Minkowski)进行操作。 I need to prune the result-set to a top 100 or so candidates and I'm using Solr to run a query. 我需要将结果集修剪到前100名左右的候选者,并且正在使用Solr来运行查询。

I run a range query using the parameters of the On+1 object eg Porosity between [9.5 TO 10.5] so +/-5% of a value, and Boolean query chain them to get a ranked list of matches. 我使用On + 1对象的参数运行范围查询,例如,孔隙率在[9.5 TO 10.5]之间,因此是值的+/- 5%,布尔查询将它们链接起来以获得排名的匹配项列表。

My questions: 我的问题:

  1. Is there a better way of doing this and obtaining a score from Solr that I could use, perhaps to threshold. 有没有更好的方法可以做到这一点,并从Solr获得我可以使用的分数,也许可以提高到阈值。 The current range query method score seems to follow a step function and unhelpful. 当前的范围查询方法得分似乎遵循阶跃函数且无济于事。

  2. Could I persist the numbers in a text_general format and search using the query numbers? 我可以将这些数字保持为text_general格式,然后使用查询数字进行搜索吗? Since the quert strings could run very long, am unsure how to approach this, perhaps using MLT? 由于队列字符串可能会运行很长时间,因此不确定如何使用MLT来解决这个问题?

Any ideas? 有任何想法吗? or suggestions for other toolkits to help with the above? 或其他工具包的建议可帮助上述工作?

Theory 理论

As you said, the range query won't work here for scoring... but it's still a good way to filter the initial index. 就像您说的那样,范围查询不适用于评分...但是,它仍然是筛选初始索引的好方法。

Once the index is filtered(or not) with some base query - we can apply custom scoring. 使用某些基本查询对索引进行过滤(或不过滤)后,我们可以应用自定义评分。

Here's some general example on how to implement a custom scoring: http://spykem.blogspot.com/2013/06/plug-in-external-score-to-solr.html 以下是一些有关如何实现自定义评分的一般示例: http : //spykem.blogspot.com/2013/06/plug-in-external-score-to-solr.html


When implementing a custom sorting - the CustomScoreProvider can receive following parameters: 实施自定义排序时-CustomScoreProvider可以接收以下参数:

  • Value step - step to lower the score 价值阶梯-降低分数的阶梯
  • Score step - lower the score by this value whenever "value step" occurs 得分步长-每当发生“价值步长”时,就将得分降低此值
  • Max additional score - "perfect match" will have that score in addition to native score(from reqular search query), non-perfect matches will have a lowered (non-negative) value 最高额外得分-“完全匹配”除原始得分(来自常规搜索查询)外,还将具有该得分,非完全匹配的得分(非负)将降低

The additional score will be lowered by "Score step" each time the distance between field value and query value will expand by "Value step", starting from "Max additional score" and until it reaches zero. 每当字段值和查询值之间的距离从“最大附加分数”扩展到“零”时,附加分数将降低“分数步长”。

The additional scoring formula will look something like this (until it reaches zero): 额外的得分公式将如下所示(直到达到零):

Max additional score - ((|fieldValue - queryValue| / Value Step ) * Score Step)

Example

So, for example, having following settings: 因此,例如,具有以下设置:

  • Value step = 0.1 值步长= 0.1
  • Score step = 0.01 得分步长= 0.01
  • Max additional score = 1 最高附加分数= 1

with following index values for some field (eg permeability): 在某些领域(例如渗透率)具有以下索引值:

  • 3 (for doc1) 3(对于doc1)
  • 5 (for doc2) 5(对于doc2)
  • 6 (for doc3) 6(对于doc3)
  • 7 (for doc4) 7(适用于doc4)
  • 99999999 (for doc5) 99999999(适用于doc5)

and if the initial search query looks like this: 并且初始搜索查询如下所示:

q={!nearestParser valueStep=0.1 scoreStep=0.01 maxStep=1}permeability:5

Then the result will look like (assuming the initial score is the same (1) for all docs) 然后结果看起来像(假设所有文档的初始分数都相同(1))

  • doc2 (with score - 2.0) doc2(得分-2.0)
  • doc3 (with score - 1.9) doc3(得分-1.9)
  • doc1 (with score - 1.8) doc1(得分-1.8)
  • doc4 (with score - 1.8) doc4(得分-1.8)
  • doc5 (with score - 1) doc5(得分-1)

Conclusion: 结论:

  • Doc2 will have the best score as it is a perfect match Doc2将是最佳匹配,将获得最高分
  • Doc3 will be the second as it is as close as possible(without perfect match) to preffered input (and within score distance) Doc3将是第二个,因为它尽可能接近(没有完美匹配)与首选输入(且在得分距离内)
  • Doc1 and doc4 will have the same score, as they both have the same distance from the initial search query. Doc1和doc4的得分相同,因为它们与初始搜索查询的距离相同。
  • Doc5 will have the initial score, as it is out-of-range to be considered as "similar" Doc5将具有初始得分,因为它超出了范围,被认为是“相似”

I will try to come with some practical example, but as it will take some time, I though it will be better to answer with the idea for now. 我将尝试给出一些实际的示例,但是由于需要一些时间,因此尽管现在最好用这个想法来回答。


Other possible solution 其他可能的解决方案

After reading about NumericRangeQuery I also had an idea about using Trie* field structure (to be specific - leverage it's ability to handle numeric range search efficiently) in order to find most the nearest value from index... but didn't figured out how to do it yet. 在阅读了NumericRangeQuery之后,我也有了一个关于使用Trie *字段结构的想法(具体来说-利用它的能力来有效地处理数字范围搜索),以便从索引中找到最接近的值...但是却不知道该怎么做。做到这一点。

This potentially may be much more performant, though much more complicated... and there's still a chance that Trie* structure cannot handle this sort of operation... 尽管可能更加复杂,但这可能会提高性能……而且Trie *结构仍然有可能无法处理此类操作……

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM