简体   繁体   English

Solr 查询/字段分析器

[英]Solr query/field analyzer

I am total beginner with Solr and have a problem with unwanted characters getting into query results.我完全是 Solr 的初学者,并且遇到了不需要的字符进入查询结果的问题。 For example when I search for "foo bar" I got content with "'foo' bar" etc. I just want to have exact matches.例如,当我搜索“foo bar”时,我得到了“'foo' bar”等内容。我只想精确匹配。 As far as I know this can be set up in schema.xml file.据我所知,这可以在 schema.xml 文件中设置。 My content field type:我的内容字段类型:

<fieldtype name="textNoStem" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <filter class="solr.LowerCaseFilterFactory" />
        <tokenizer class="solr.KeywordTokenizerFactory" />
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
    </analyzer>
</fieldtype>

Please let me know if you know the solution.如果您知道解决方案,请告诉我。 Kind Regards.亲切的问候。

For both analyzers, the first line should be the tokenizer.对于这两个分析器,第一行应该是分词器。 The tokenizer is used to split the text into smaller units (words, most of the time).分词器用于将文本拆分为更小的单元(大多数情况下是单词)。 For your need, the WhitespaceTokenizerFactory is probably the right choice.根据您的需要, WhitespaceTokenizerFactory可能是正确的选择。

If you want absolute exact match, you do not need any filter after the tokenizer.如果您想要绝对精确匹配,则在分词器之后不需要任何过滤器。 But if you do no want searches to be case sensitive, you need to add a LowerCaseFilterFactory .但是,如果您不希望搜索区分大小写,则需要添加一个LowerCaseFilterFactory

Notice that you have two analyzers: one of type 'index' and the other of type 'query'.请注意,您有两个分析器:一个是“索引”类型,另一个是“查询”类型。 As the names implied, the first one is used when indexing content while the other is used when you do queries.顾名思义,第一个用于索引内容,而另一个用于查询。 A rule that is almost always good is to have the same set of tokenizers/filters for both analyzers.几乎总是好的规则是为两个分析器使用相同的标记器/过滤器集。

如果您只想精确匹配,请在查询时使用 KeywordTokenizerFactory 而不是 StandardTokenizerFactory。

I guess you dont get any results because the tokening is done differently on the data that is already indexed.我猜您不会得到任何结果,因为标记对已编入索引的数据的处理方式不同。 As Pascal said, whitespaceTokenizer is the right choice in your case.正如 Pascal 所说, whitespaceTokenizer 是您的正确选择。 Use it at both index and query time and check the results after indexing some data, not on the previously indexed data.在索引和查询时都使用它,并在索引一些数据后检查结果,而不是之前索引的数据。

I suggest using analysis page to see the results with out actually indexing.Its quite useful.Make changes in schema, refresh the core, go to analysis page and look at verbose output to get the step by step analysis.我建议使用分析页面查看结果而不实际索引。它非常有用。更改架构,刷新核心,转到分析页面并查看详细输出以获取分步分析。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM