简体   繁体   English

索引和查询时多个令牌过滤器的Apache Solr性能问题

[英]Apache Solr performance issue for multiple token filters at index and query time

I have to convert Number numbers from one language to other in Apache Solr 6.6.2. 我必须在Apache Solr 6.6.2中将数字从一种语言转换为另一种语言。 For that I have found pattern replacement filter could do this job. 为此,我发现模式替换过滤器可以完成此工作。 I have added a new field in Solr schema with following filters 我在Solr模式中使用以下过滤器添加了一个新字段

<fieldType name="text_use" class="solr.TextField">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>

    <filter class="solr.PatternReplaceFilterFactory" pattern="0" replacement="۰"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="1" replacement="۱"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="2" replacement="۲"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="3" replacement="۳"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="4" replacement="۴"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="5" replacement="۵"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="6" replacement="۶"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="7" replacement="۷"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="8" replacement="۸"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="9" replacement="۹"/>
    </analyzer>

    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
    <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
    <filter class="solr.LowerCaseFilterFactory"/>

    <filter class="solr.PatternReplaceFilterFactory" pattern="0" replacement="۰"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="1" replacement="۱"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="2" replacement="۲"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="3" replacement="۳"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="4" replacement="۴"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="5" replacement="۵"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="6" replacement="۶"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="7" replacement="۷"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="8" replacement="۸"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="9" replacement="۹"/>
    </analyzer>
</fieldType>

Applying a lot of filters at query and index time is good ? 在查询和索引时应用很多过滤器是好的吗? Is there an any performance issue in system due to large number of filter ? 由于过滤器数量众多,系统中是否存在任何性能问题? Finally, Is it possible to write one filter with regex pattern ? 最后,是否可以使用正则表达式模式编写一个过滤器? If yes then what will be that ? 如果是,那会是什么?

Performance - try it both with and without. 性能-无论有无,都可以尝试。 Since indexing is usually done without a very hard performance requirement, it's usually not a very large issue if it takes a few milliseconds extra. 由于通常无需非常严格的性能要求即可完成索引编制,因此,如果要多花几毫秒的时间, 通常就不是很大的问题。 For queries it's just the query text that's being processed, and that is far less content than the documents themselves. 对于查询,只是要处理的查询文本,其内容远少于文档本身。

I don't think there as an easier way to do what you want with the patternreplacementfilter, since you're looking for a specific replacement for each digit. 我不认为使用patternreplacementfilter可以更轻松地完成所需的操作,因为您正在寻找每个数字的特定替换。

Writing your own filter would probably be the easiest way - an urdu numeric conversion filter, which probably would be useful for more people as well (so upload it to a github repo). 编写自己的过滤器可能是最简单的方法-乌尔都语数字转换过滤器,它也可能对更多人有用(因此将其上传到github存储库)。 In a separate filter you can perform all the replacements in a single go, and you can do it without regex support (although the performance difference might not be much, it should at least be faster than invoking the regex engine ten times - but again, test it yourself). 在单独的过滤器中,您可以一次执行所有替换操作,并且可以在没有regex支持的情况下进行操作(尽管性能差异可能不大,但至少应比调用regex引擎快十倍-但是,自己进行测试)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM