如何在Solr上的multiValue字段中提升重復值

Question

我在solr索引的multiValue字段中有一些重復（相同的字符串）數據，我想通過該字段中的匹配計數來提升文檔。 例如：

doc1 : { locales : ['en_US', 'de_DE', 'fr_FR', 'en_US'] }
doc2 : { locales : ['en_US'] }

當我運行查詢時q=locales:en_US我希望在頂部看到doc1，因為它有兩個“en_US”值。 提升此類數據的正確方法是什么？

我應該使用特殊的標記器嗎？

Solr版本是：4.5

Answer 1

放棄

要使用以下任一解決方案，您需要進行以下任一更改：

為區域設置創建copyField：

<field name="locales" type="string" indexed="true" stored="true" multiValued="true"/>
<!-- No need to store(stored="false") locales_text as it will only be used for searching/sorting/boosting -->
<field name="locales_text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<copyField source="locales" dest="locales_text"/>

將語言環境的類型更改為“text_general”（類型在標准solr集合1中提供）

第一個解決方案（訂購）：

結果可以通過某種功能來訂購。 所以我們可以在字段中按出現次數（termfreq函數）排序：

如果使用copyField，則排序查詢將為： termfreq(locales_text,'en_US') DESC
如果locales是text_general類型，那么sort查詢將是： termfreq(locales,'en_US') DESC

copyField選項的示例響應（text_general類型的結果相同）：

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">1</int>
  <lst name="params">
    <str name="fl">*,score</str>
    <str name="sort">termfreq(locales_text,'en_US') DESC</str>
    <str name="indent">true</str>
    <str name="q">locales:en_US</str>
    <str name="_">1383598933337</str>
    <str name="wt">xml</str>
  </lst>
</lst>
<result name="response" numFound="2" start="0" maxScore="0.5945348">
  <doc>
    <arr name="locales">
      <str>en_US</str>
      <str>de_DE</str>
      <str>fr_FR</str>
      <str>en_US</str>
    </arr>
    <str name="id">4f9f71f6-7811-4c22-b5d6-c62887983d08</str>
    <long name="_version_">1450808563062538240</long>
    <float name="score">0.4203996</float></doc>
  <doc>
    <arr name="locales">
      <str>en_US</str>
    </arr>
    <str name="id">7f93e620-cf7b-4b90-b741-f6edc9db77c9</str>
    <long name="_version_">1450808391856291840</long>
    <float name="score">0.5945348</float></doc>
</result>
</response>

您還可以使用fl=*,termfreq(locales_text,'en_US')查看匹配數。

要記住一件事 - 它是一個訂單功能，而不是一個增強功能。 如果您希望根據多場比賽提高分數，您可能會對第二種解決方案更感興趣。

我在結果中加入了分數來證明@arun正在談論的內容。 你可以看到得分是不同的（可能是長度的）...非常意外（對我來說）對於多值字符串它是相同的。

第二種解決方案（提升）：

如果使用copyField，則查詢將為： {!boost b=termfreq(locales_text,'en_US')}locales:en_US
如果locales是text_general類型，則查詢將是： {!boost b=termfreq(locales,'en_US')}locales:en_US

copyField選項的示例響應（text_general類型的結果相同）：

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">0</int>
  <lst name="params">
    <str name="lowercaseOperators">true</str>
    <str name="fl">*,score</str>
    <str name="indent">true</str>
    <str name="q">{!boost b=termfreq(locales_text,'en_US')}locales:en_US</str>
    <str name="_">1383599910386</str>
    <str name="stopwords">true</str>
    <str name="wt">xml</str>
    <str name="defType">edismax</str>
  </lst>
</lst>
<result name="response" numFound="2" start="0" maxScore="1.1890696">
  <doc>
    <arr name="locales">
      <str>en_US</str>
      <str>de_DE</str>
      <str>fr_FR</str>
      <str>en_US</str>
    </arr>
    <str name="id">4f9f71f6-7811-4c22-b5d6-c62887983d08</str>
    <long name="_version_">1450808563062538240</long>
    <float name="score">1.1890696</float></doc>
  <doc>
    <arr name="locales">
      <str>en_US</str>
    </arr>
    <str name="id">7f93e620-cf7b-4b90-b741-f6edc9db77c9</str>
    <long name="_version_">1450808391856291840</long>
    <float name="score">0.5945348</float></doc>
</result>
</response>

您可以看到分數發生了顯着變化。 第一個文檔得分比第二個得分多兩倍（因為有兩個匹配，每個得分為0.5945348）。

第三種解決方案（omitNorms = false）

根據@arun的回答，我認為還有第三種選擇。

如果將字段轉換為（例如） text_general並為該字段設置omitNorms=true - 它應該具有相同的結果。

Answer 2

Solr中的默認標准請求處理程序不僅使用術語頻率來計算分數。 除術語頻率外，它還使用字段的長度。 請參閱lucene評分算法，其中說：

lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score.

由於doc2的字段較短，因此可能得分較高。 使用fl=*,score檢查結果的fl=*,score ，在查詢中fl=*,score 。 要了解Solr如何得分，請使用fl=*,score&wt=xml&debugQuery=on （然后右鍵單擊瀏覽器並查看page-source以查看正確縮進的分數計算）。 我相信你會看到lengthNorm對doc1的得分較低。

要使該字段的長度不對分數有貢獻，您需要禁用它。 為該字段設置omitNorms=true 。 （參考： http ： //wiki.apache.org/solr/SchemaXml ）然后看看分數是多少。

如何在Solr上的multiValue字段中提升重復值

問題描述

2 個解決方案

解決方案1
4 2013-11-04 21:31:17

放棄

第一個解決方案（訂購）：

第二種解決方案（提升）：

第三種解決方案（omitNorms = false）

解決方案2
0 2013-11-04 05:12:40

如何在Solr上的multiValue字段中提升重復值

問題描述

2 個解決方案

解決方案1 4 2013-11-04 21:31:17

放棄

第一個解決方案（訂購）：

第二種解決方案（提升）：

第三種解決方案（omitNorms = false）

解決方案2 0 2013-11-04 05:12:40

解決方案1
4 2013-11-04 21:31:17

解決方案2
0 2013-11-04 05:12:40