简体   繁体   中英

solr dismax phrase search

I'm building an application which uses solr to match longer queries (typically, complete sentences) against indexed documents which are almost always shorter (search terms). So, my query looks like "should I buy a house now while the rates are low. We filed BR 2 yrs ago. Rent now, w/ some sch loan debt" and my indexed documents are like "buy a house", "house loan rates".

I thought that the right way to do this would be to use shingles, the dismax parser, and highly boosted "pf" field. So, I have a "normal" text field, kw_stopped (text_en in solr 3.4) with a very aggressive stopword list, and a kw_phrases field which is meant to be the phrase shingles. Its definition looks like this:

<fieldType name="shingle" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
    catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="8" outputUnigrams="false"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
    catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="8" outputUnigrams="false"/>
  </analyzer>
</fieldType>

and my schema fields look like this:

<field name="kw_stopped" type="text_en" indexed="true" omitNorms="True" />
<!-- keywords almost as is - to provide truer match for full phrases -->
<field name="kw_phrases" type="shingle" indexed="true" omitNorms="True" />

My search handler config is this:

<requestHandler name="edismax" class="solr.SearchHandler" default="true">
  <lst name="defaults">
  <str name="defType">edismax</str>
  <str name="echoParams">explicit</str>
  <float name="tie">0.1</float>
  <str name="fl">
    keywords
  </str>
  <str name="mm">1</str>
  <str name="qf">
    kw_stopped^1.0 kw_phrases^5.0
  </str>
  <str name="pf">
    kw_phrases^50.0
  </str>
  <int name="ps">3</int>
  <int name="qs">3</int>
  <str name="q.alt">*:*</str>
 </lst>
</requestHandler>

When I turn on debugQuery, I notice that the "kw_phrases" is never matched unless the query and the document are exactly the same. Also the parsedquery shows that the each of the tokenized from the query appear as single DisjunctionMaxQuery clauses for "kw_stopped", but all shingles are put in one giant clause for the kw_phrases field.

Where is the gap in my understanding? How can I make this work?

thanks! Vijay

If you are using long sentences to search against shorter documents, you seem to be going fine.

  • Using Edismax query parser
  • Using mm value to very low value or 0% , so that the behavior is same as OR ie any of the words. You can change for it to match atleast 2 or 3 words to prevent words with single word matches being returned.
  • This will allow you to control how the terms in the search string should be matched for the document to returned back.
  • Using the pf (phrase fields) to match documents higher which have the exact matches.
  • Instead of the explicit shingle filter, Use the pf2 and pf3 (shingled phrase fields) fields to match documents higher which have the shingle matches for the two or three words combination.
  • use ps (phrase slop) value to provide an adequate slop value for phrase matches.

Surely, you would need a nice stopwords filter list to prevent general terms matches during both index and search time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM