简体   繁体   中英

Why solr not index some segmented words

I'm trying to index some Chinese documents with Solr, but it looks like Solr doesn't index some segmented words.

Analyzer I use is IK analyzer http://code.google.com/p/ik-analyzer/ .

The field to be indexed:

 <field name="hospital_alias_splitted" type="cn_ik" indexed="true" stored="true" multiValued="true" omitNorms="false"/>

cn_ik definition:

<fieldType name="cn_ik" class="solr.TextField" positionIncrementGap="100">
<analyzer> 
    <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" useSmart ="false"/>
</analyzer>

For example, the word that will be indexed is "AB" (without quotes). After word segmentation using a Chinese analyzer, I got 3 tokens, they are "AB", "A" and "B".

As we can see, the first token "AB" covers the following two tokens.

After feeding these tokens to Solr, it looks like Solr only index "AB", "A" and "B" are ignored. Because when I search "A" or search "B" doesn't get any result.

I guess when Solr indexing "AB", it already reaches the end of indexed word, so "A" and "B" are ignored.

Using Luke and Analysis Request Handler don't show me more hints. I'm not sure this is a bug or a feature of Solr.

Any comment or suggestion?

Thanks :)

(As I am not able to comment on the question, I am typing here)

I would recommend you to try it with different analzyers. As you didnt tell us your analyzer, I assume that you are using something default like CJK and so on.

As far as I know, there are more analyzers for Chinese and languages like Chinese which dont have spaces between two words. They might also help you.

It would be really nice to see some part of your schema about that field though...

edit : you can also check this link

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM