简体   繁体   English

EdgeNGram与Tire和ElasticSearch

[英]EdgeNGram with Tire and ElasticSearch

If I have two strings: 如果我有两个字符串:

  • Doe, Joe 母鹿,乔
  • Doe, Jonathan 美国能源部乔纳森

I want to implement a search such that: 我想实现这样的搜索:

  • "Doe" > "Doe, Joe", "Doe, Jonathan" “ Doe”>“ Doe,Joe”,“ Doe,Jonathan”
  • "Doe J" > "Doe, Joe", "Doe, Jonathan" “ Doe J”>“ Doe,乔”,“ Doe,乔纳森”
  • "Jon Doe" > "Doe, Jonathan" “乔恩·多伊(Jon Doe)”>“乔伊·多恩(Doon,Jonathan)”
  • "Jona Do" > "Doe, Jonathan" “乔纳(Jona Do)”>“乔(Dona)乔纳森(Doe,Jonathan)”

Here's the code that I have: 这是我的代码:

settings analysis: {
    filter: {
      nameNGram: {
        type: "edgeNGram",
        min_gram: 1,
        max_gram: 20,
      }
    },
    tokenizer: {
      non_word: {
        type: "pattern",
        pattern: "[^\\w]+"
      }
    },
    analyzer: {
      name_analyzer: {
        type: "custom",
        tokenizer: "non_word",
        filter: ["lowercase", "nameNGram"]
      },
    }
  } do
  mapping do
    indexes :name, type: "multi_field", fields: {
      analyzed:   { type: "string", index: :analyzed, index_analyzer: "name_analyzer" }, # for indexing
      unanalyzed: { type: "string", index: :not_analyzed, :include_in_all => false } # for sorting
    }
  end
end

def self.search(params)
  tire.search(:page => params[:page], :per_page => 20) do
    query do
      string "name.analyzed:" + params[:query], default_operator: "AND"
    end
    sort do
      by "name.unanalyzed", "asc"
    end
  end
end

Unfortunately, this doesn't appear to be working... The tokenizing looks great, for "Doe, Jonathan" I get something like "d", "do", "doe", "j", "jo", "jon", "jona" etc. but if I search for "do AND jo", I get back nothing. 不幸的是,这似乎不起作用...标记化效果很好,对于“ Doe,Jonathan”,我得到类似“ d”,“ do”,“ doe”,“ j”,“ jo”,“ jon”的信息”,“ jona”等,但是如果我搜索“ do AND jo”,则一无所获。 If I, however, search for "jona", I get back "Doe, Jonathan." 但是,如果我搜索“ jona”,则会返回“ Doe,Jonathan”。 What am I doing wrong? 我究竟做错了什么?

You should likely only be using EdgeNGram if you want to create an autocomplete. 如果您要创建自动完成功能,则应该只使用EdgeNGram。 I suspect that you want to use a pattern filter to separate words my commas. 我怀疑您想使用模式过滤器来分隔逗号。

Something like this: 像这样:

"tokenizer": {
    "comma_pattern_token": {
         "type": "pattern",
         "pattern": ",",
         "group": -1
     }
 }

If I am mistaken and you need edgeNGrams for some other reason then your problem is that your index analyzer is ignoring stop words (such as the word AND) and your search analyzer is not. 如果我弄错了,并且由于其他原因而需要edgeNGrams,那么您的问题是索引分析器忽略了停用词(例如AND),而搜索分析器则没有。 You need to create a custom analyzer for your search_analyzer that does not include the stop word filter. 您需要为您的search_analyzer创建一个不包含停用词过滤器的自定义分析器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM