简体   繁体   English

Elasticsearch中的I18n搜索和过滤

[英]I18n search and filtering in Elasticsearch

tldr; tldr;

How to match and filter localized search with a localized index ? 如何使用本地化索引匹配和过滤本地化搜索?

long version 长版

I have an application where the user search must be done in the context of it's language. 我有一个应用程序,用户搜索必须在其语言范围内进行。

In elastic search index, I want documents with both i18n properties and non i18n properties (I want to avoid creating multiple index, one for each language). 在弹性搜索索引中,我想要同时具有i18n属性和非i18n属性的文档(我想避免创建多个索引,每种语言一个)。

The mapping of the document should look like : 文档的映射应如下所示:

'entry': {
'properties': {
  'name' : {'type': 'string'}, /* unlocalized properties */
  'category': { /* localized properties */
      "properties" : {
          "lang_fr" : {
              "type" : "string"
          },
          "lang_de" : {
              "type" : "string"
          }
      }
  },}}

having that, I have two requirements: 有了这个,我有两个要求:

1) Matching: when doing a search, exclude from search the localized fields that are not concerned by the user language (let's say the user's language is 'fr', I want to exclude 'de' fields from search. How to do this without specifying the entire list of fields I want to search on. To start simple, I tried this but it doesn't work : 1) 匹配:进行搜索时,请从搜索中排除用户语言不关心的本地化字段(假设用户的语言为“ fr”,我想从搜索中排除“ de”字段。指定要搜索的字段的整个列表。为简单起见,我尝试过此操作,但不起作用:

{
  "query": {
    "match": {
      "*.lang_fr": "full_text"
    }
  }
}

However, "categories.lang_fr": "full_text" works well. 但是, "categories.lang_fr": "full_text"效果很好。 But I don't want to maintain the list of fields in the query. 但是我不想维护查询中的字段列表。 I want a general rule like you can do in SolR. 我想要像在SolR中一样的一般规则。

2) Filtering: when I retrieve my results, I want to filter out all localized fields that doesn't corresponds to my user language. 2) 筛选:检索结果时,我想筛选出所有与用户语言不对应的本地化字段。 In other words, using the source filter, I'd like to have all unlocalized fields, exclude all fields starting with "lang " , but include all fields being 'lang_fr'. 换句话说,使用源过滤器,我想拥有所有未本地化的字段,排除所有以“ lang开头的字段 ,但要包括所有为“ lang_fr”的字段。 I tried the following but it doesn't work: 我尝试了以下操作,但不起作用:

{
"_source": {
    "include": [ "*", "*.lang_fr" ],
    "exclude": [ "*.lang_*" ],
}
...}

the wildcard operator doesn't seems to work. 通配符运算符似乎不起作用。 I partially have what I want if I specify "categories.lang_de" , but again, I don't want to maintain the list of fields, I want a generic rule. 如果指定"categories.lang_de" ,我会部分得到所需的信息,但是同样,我不想维护字段列表,我想要一个通用规则。 The include/exclude operation doesn't work as I would like. 包含/排除操作无法正常运行。 The only thing that actually works is a query where I specify all languages to exclude for all fields specifically, such as : 唯一有效的方法是查询,在该查询中,我为所有字段指定了要排除的所有语言,例如:

{
"_source": {
    "exclude": [ "categories.lang_de", "categories.lang_en",  "categories.lang_it", 
                         "another_field.lang_de", "catanother_fieldgories.lang_en",  "another_field.lang_it"],
}
...}

for 'fr' search. 用于“ fr”搜索。

I'm quite surprised I couldn't find anything on google. 我很惊讶我在Google上找不到任何东西。 I see it as a very standard case of i18n applied to elasticsearch. 我认为这是应用于弹性搜索的i18n的非常标准的情况。 Maybe I'm modelizing i18n the wrong way in ES ? 也许我在ES中以错误的方式对i18n建模?

thank you in advance ! 先感谢您 !

You can achieve the first one using a query_string query which takes advantage of the powerful Lucene expression language and allows to specify wildcard in field names : 您可以使用query_string查询来实现第一个查询 ,该查询利用了强大的Lucene表达式语言并允许在字段名称中指定通配符

{
  "query": {
    "query_string": {
      "query": "\\*.lang_fr:full_text"
    }
  }
}

or you can also specify the field name in the fields parameter, like this 或者您也可以在fields参数中指定字段名称,例如

{
  "query": {
    "query_string": {
      "query": "full_text"
      "fields": ["*.lang_fr"]
    }
  }
}

As for your second one, source filtering is indeed the way to go but I suggest simply excluding all languages but the one you're searching for. 至于第二种,确实可以使用源过滤,但是我建议您仅排除所有语言,但不包括您要搜索的语言。 For instance, if the search is in French, you'd simply exclude all other languages without necessarily having to enumerate all the fields, just all the languages that you don't want (which would be much less). 例如,如果搜索使用法语,则只需排除所有其他语言,而不必枚举所有字段,只需列举所有不需要的语言(这会少得多)。 That would allow you to add localized fields as you go without having to change the query. 这样一来,您就可以在不更改查询的情况下随时添加本地化字段。

{
"_source": {
    "exclude": [ "*.lang_de", "*.lang_it" ],
}
...}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM