简体   繁体   English

Solr:如何使用关键字列表获取按分数排序的所有文档?

[英]Solr: How can I get all documents ordered by score with a list of keywords?

I have a Solr 3.1 database containing Emails with two fields:我有一个 Solr 3.1 数据库,其中包含具有两个字段的电子邮件:

  • datetime约会时间
  • text文本

For the query I have two parameters:对于查询,我有两个参数:

  • date of today今天的日期
  • keyword array("important thing", "important too", "not so important, but more than average")关键字数组(“重要的事情”,“也很重要”,“不是那么重要,但超过平均水平”)

Is it possible to create a query to是否可以创建查询

  1. get ALL documents of this day AND获取当天的所有文件和
  2. sort them by relevancy by ordering them so that the email with contains most of my keywords(important things) scores best?通过订购它们按相关性对它们进行排序,以便包含我的大部分关键字(重要事物)的 email 得分最高?

The part with the date is not very complicated:带日期的部分不是很复杂:

fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]

I know that you can boost the keywords this way:我知道您可以通过以下方式提升关键字:

q=text:"first keyword"^5 OR text:"second one"^2 OR text:"minus scoring"^0.5 OR text:"*"

But how do I only use the keywords to sort this list and get ALL entries instead of doing a realy query and get only a few entries back?但是我如何只使用关键字对该列表进行排序并获取所有条目,而不是进行真正的查询并仅返回几个条目?

Thanks for help!感谢帮助!

You could do a first query for:您可以对以下内容进行第一次查询:

fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z] fq=datetime[YY-MM-DDT00:00:00.000Z 到 YY-MM-DDT23:59:59.999Z]

which gives all documents that match the range.它给出了与范围匹配的所有文档。 Then, use CachingWrapperFilter for the second query to find documents in the DocSet from first query which have at least one keyword.然后,对第二个查询使用 CachingWrapperFilter 从第一个查询的 DocSet 中查找具有至少一个关键字的文档。 They will be relevance ranked per tf-idf.它们将根据 tf-idf 进行相关性排名。 You may want to use ConstantScoringQuery for the first to get the list of matching docids in the fastest possible way.您可能希望首先使用 ConstantScoringQuery 以最快的方式获取匹配的 docid 列表。

You need to specify your terms in the main query and then change your date query to be a filter query on these results by adding the following.您需要在主查询中指定您的术语,然后通过添加以下内容将您的日期查询更改为对这些结果的过滤查询。

fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]

So you should have something like this:所以你应该有这样的东西:

q=<terms go here>&fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]

Edit: A little more about filter queries (as suggested by rfreak ).编辑:关于过滤器查询的更多信息(如rfreak所建议的)。

From Solr Wiki - FilterQuery Guidance - "Now, what is a filter query? It is simply a part of a query that is factored out for special treatment. This is achieved in Solr by specifying it using the fq (filter query) parameter instead of the q (main query) parameter. The same result could be achieved leaving that query part in the main query. The difference will be in query efficiency. That's because the result of a filter query is cached and then used to filter a primary query result using set intersection."来自Solr Wiki - FilterQuery Guidance - “现在,什么是过滤器查询?它只是查询的一部分,需要进行特殊处理。这是在 Solr 中通过使用 fq(过滤器查询)参数而不是q(主查询)参数。把那个查询部分留在主查询中可以达到相同的结果。不同之处在于查询效率。这是因为过滤查询的结果被缓存,然后用于过滤主查询结果使用集合交集。”

These should be sorted by relevancy score already, that is just the default behavior of Solr.这些应该已经按相关性分数排序,这只是 Solr 的默认行为。 You can see the score by adding that field.您可以通过添加该字段来查看分数。

fl=*,score

If you use the Full Interface for Make A Query on the Admin Interface on your Solr installation at http://<yourserver:port#>/<instancename>/admin/form.jsp you will see where you can specify the filter query, fields, and other options.如果您在 Solr 安装的http://<yourserver:port#>/<instancename>/admin/form.jsp上的管理界面上使用完整界面进行查询,那么您将在哪里看到您可以指定 3 查询760157E5Z6字段和其他选项。 You can check out the Solr Wiki for more details on the options and how they are used.您可以查看Solr Wiki ,了解有关选项及其使用方式的更多详细信息。

I hope that this helps you.我希望这对你有帮助。

Sorting by relevance is default behavior on solr/lucene.按相关性排序是 solr/lucene 的默认行为。

If your results are unsatisfied, try to put the keywords in quotes如果您的结果不满意,请尝试将关键字放在引号中

//Edit: Folowing the answer from Paige Cook, use somethink like that //编辑:按照佩奇库克的回答,使用这样的想法

q="important thing"&fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]

//2. //2。 nd update. nd更新。 By thinking about this answer: quotes are not an good idea, because in this case you will only receive "important thing" mails, but no "important too"通过考虑这个答案:引号不是一个好主意,因为在这种情况下,您只会收到“重要的事情”邮件,而不会收到“也很重要”的邮件

The Point is: what keywords you are using.重点是:您使用的是什么关键字。 Because: searching for -- important thing -- results in the highest scores for "important thing" mails.因为:搜索——重要的事情——会导致“重要事情”邮件的得分最高。 But lucene does not know, how to score "important too" or "not so important, but more than average" in relation to your keywords.但是 lucene 不知道,如何为您的关键字评分“也很重要”或“不那么重要,但超过平均水平”。 An other idea would be searching only for "important".另一个想法是只搜索“重要”。 But the field-values "importand thing" and "importand too" gives nearly the same score values,because 50% of the searched keywords (in this key: "imported") are part of the field-value.但是字段值“importand thing”和“importand too”给出了几乎相同的分值,因为 50% 的搜索关键字(在此键中:“imported”)是字段值的一部分。 So probably you have to change your keywords.因此,您可能必须更改关键字。 It could work after changeing "importend to" into "also an important mail", to get the beast ratio of search-word "important" and field-value in order to score the shortest Mail-discripton to the highest value.将“importend to”改为“also an important mail”后,可以得到搜索词“important”与field-value的野兽比例,从而将Mail-discripton的最短值打分到最高值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM