简体   繁体   中英

Searching a numeric value from a (n)varchar column

Am implementing full-text search using solr and I would appreciate it if someone could offer me some help with some problem am facing.

My schema.xml looks as follows:

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="products" version="1.2">
    <types>
        <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
            <fieldType name="concatenated" class="solr.TextField" positionIncrementGap="100" >
                <analyzer>
                <tokenizer class="solr.LowerCaseTokenizerFactory"/>
                <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="15" side="front"/>
                <filter class="solr.WordDelimiterFilterFactory"
                    splitOnCaseChange="0"
                    splitOnNumerics="1"
                            catenateWords="1"
                            catenateNumbers="1"
                            catenateAll="1"
                            preserveOriginal="1"
                    />
                </analyzer>
            </fieldType>
    </types>
    <fields>
        <field name="keyid" type="long" indexed="true" stored="false" required="true"/>
        <field name="combined" type="concatenated" indexed="true" stored="false"/>
    </fields>
    <uniqueKey>keyid</uniqueKey>
    <defaultSearchField>combined</defaultSearchField> 
    <copyField source="keyid" dest="keyid"/>  
    <solrQueryParser defaultOperator="OR"/>
</schema>

And my data-config.xml file looks as follows:

<dataConfig>
    <document name="products">
        <entity name="product" query="SELECT ProductId AS keyid, CONVERT(VARCHAR(18), ProductId) + ' ' + ProductName AS combined FROM Products"
            <field column="keyid" name="keyid"/>
            <field column="combined" name="combined"/>
        </entity>
    </document>
</dataConfig>

And I have a record like follows in my Products table

ProductId|ProductName

239289231|Windows 7

Assuming a successful setup and indexing (using localhost:8089/sorl/dataimport?command=full-import ), why would I not get results when I run this query:

Scenario 1: localhost:8089/solr/select?q=combined:239289233

Yet the queries below do give me results (one searching from the keyid field and another from the combined field):

Scenario 2: localhost:8089/solr/select?q=combined:Windows

Scenario 3: localhost:8089/solr/select?q=keyid:239289233

Is the problem the TokenizerFactory or FilterFactory that am using here? Shouldn't Solr treat ProductId as a string after its cast to VARCHAR and concatenated - hence make it possible to call it out the way am doing in Scenario 1 ?

Yes, the issue here is the tokenizers. Your first tokenizer, the LowerCaseTokenizerFactory completely strips off the numbers, so that is why you cannot find search and find any values with your ProductId values. In your example case, it is only indexing the word Windows.

I am assuming you perhaps want to lowercase the value, so you would want to use the StandardTokenizerFactory as your tokenizer, and LowerCaseFilterFactory as a filter to lowercase the values. That will include the ProductId value as a token to be indexed and have NGrams built against the following tokens - 239289231 , Windows and 7 .

Here is a suggested modified fieldType

  <fieldType name="concatenated" class="solr.TextField" positionIncrementGap="100" >
     <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
            maxGramSize="15" side="front"/>
        <filter class="solr.WordDelimiterFilterFactory"
             splitOnCaseChange="0"
             splitOnNumerics="1"
             catenateWords="1"
             catenateNumbers="1"
             catenateAll="1"
             preserveOriginal="1"
             />
      </analyzer>
   </fieldType>

Also, I would recommend reviewing the Analyzers, Tokenizers and Token Filters page on the Solr Wiki for examples of how the various ones work, if you have not already. In this case it was just a mix up between a tokenizer and a filter I believe.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM