简体   繁体   English

elasticsearch 中的分析器

[英]Analyzers in elasticsearch

I'm having trouble understanding the concept of analyzers in elasticsearch with tire gem.我无法理解使用轮胎 gem 的 elasticsearch 中分析器的概念。 I'm actually a newbie to these search concepts.我实际上是这些搜索概念的新手。 Can someone here help me with some reference article or explain what actually the analyzers do and why they are used?这里有人可以帮我写一些参考文章或解释分析仪的实际作用以及使用它们的原因吗?

I see different analyzers being mentioned at elasticsearch like keyword, standard, simple, snowball.我看到在 elasticsearch 中提到了不同的分析器,如关键字、标准、简单、滚雪球。 Without the knowledge of analyzers I couldn't make out what actually fits my need.没有分析仪的知识,我无法确定什么真正适合我的需要。

Let me give you a short answer.让我给你一个简短的回答。

An analyzer is used at index Time and at search Time.在索引时间和搜索时间使用分析器。 It's used to create an index of terms.它用于创建术语索引。

To index a phrase, it could be useful to break it in words.要索引一个短语,将其分解为单词可能会很有用。 Here comes the analyzer.分析仪来了。

It applies tokenizers and token filters.它应用分词器和分词过滤器。 A tokenizer could be a Whitespace tokenizer.分词器可以是空白分词器。 It split a phrase in tokens at each space.它在每个空格的标记中分割一个短语。 A lowercase tokenizer will split a phrase at each non-letter and lowercase all letters.小写标记器将在每个非字母处拆分一个短语,并将所有字母小写。

A token filter is used to filter or convert some tokens.令牌过滤器用于过滤或转换某些令牌。 For example, a ASCII folding filter will convert characters like ê, é, è to e.例如,ASCII 折叠过滤器会将 ê、é、è 等字符转换为 e。

An analyzer is a mix of all of that.分析器是所有这些的混合体。

You should read Analysis guide and look at the right all different options you have.您应该阅读分析指南并查看您拥有的所有不同选项。

By default, Elasticsearch applies the standard analyzer.默认情况下,Elasticsearch 应用标准分析器。 It will remove all common english words (and many other filters)它将删除所有常见的英语单词(以及许多其他过滤器)

You can also use the Analyze Api to understand how it works.您还可以使用分析 Api来了解它的工作原理。 Very useful.很有用。

In Lucene , analyzer is a combination of tokenizer (splitter) + stemmer + stopword filterLucene ,analyzer 是分词器(splitter)+词干分析器+停用词过滤器的组合

In ElasticSearch , analyzer is a combination ofElasticSearch ,分析器是

  1. Character filter : "tidy up" a string before it is tokenized eg remove HTML tags Character filter :在标记化之前“整理”一个字符串,例如删除 HTML 标签
  2. Tokenizer : It's used to break up the string into individual terms or tokens. Tokenizer :它用于将字符串分解为单独的术语或标记。 Must have 1 only.必须只有 1 个。
  3. Token filter : change, add or remove tokens. Token filter :更改、添加或删除令牌。 Stemmer is an example of token filter. Stemmer 是令牌过滤器的一个例子。 It's used to get the base of the word eg happy and happiness both have the same base is happi .它被用来获得了这个词例如基happyhappiness都具有相同的基本是happi

See Snowball demo here在此处查看Snowball 演示

This is a sample setting:这是一个示例设置:

     {
      "settings":{
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "analyzerWithSnowball" : {
                        "tokenizer" : "standard",
                        "filter" : ["standard", "lowercase", "englishSnowball"]
                    }
                },
                "filter" : {
                    "englishSnowball" : {
                        "type" : "snowball",
                        "language" : "english"
                    }
                }
            }
        }
      }
    }

Ref:参考:

  1. Comparison of Lucene Analyzers Lucene 分析器的比较
  2. http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/custom-analyzers.html http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/custom-analyzers.html

Here's an awesome plugin on github repo .这是github repo上的一个很棒的插件。 It's somewhat extension of Analyze API .它是Analyze API的某种扩展。 Found it on official elastic plugin list .在官方弹性插件列表中找到它。

What's great is that it shows tokens with all their attributes after every single step.很棒的是,它在每一步之后都会显示带有所有属性的令牌。 With this it is easy to debug analyzer configuration and see why we got such tokens and where we lost the ones we wanted.有了这个,调试分析器配置很容易,看看为什么我们得到这样的令牌以及我们在哪里丢失了我们想要的令牌。

Wish I had found it earlier than today.希望我早于今天找到它。 Thanks to that I just found out why my keyword_repeat token tokenizer seemed to not work correctly.多亏了这一点,我才发现为什么我的keyword_repeat令牌标记器似乎无法正常工作。 The problem was caused by next token filter: icu_transform (used for transliteration) which unfortunately didn't respect keyword attribute and transformed all of the tokens.问题是由下一个标记过滤器引起的: icu_transform (用于音译),不幸的是它不尊重关键字属性并转换了所有标记。 Don't know how else would I find the cause if not for this plugin.如果不是这个插件,我不知道我还能怎么找到原因。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM