简体   繁体   English

索引/搜索Elasticsearch中的“复杂” JSON

[英]Indexing/Searching “complex” JSON in elasticsearch

I have some JSON that looks like the following: Let's call that field metadata 我有一些类似于以下内容的JSON:让我们将该字段称为元数据

{ 
  "somekey1": "val1",
  "someotherkey2": "val2",
  "more_data": { 
    "contains_more": [
      { 
        "foo": "val5",
        "bar": "val6"
      },
      { 
        "foo": "val66",
        "baz": "val44"
      },
    ],
    "even_more": {
      "foz" : 1234,
    }
  }
}

This is just a simple example. 这只是一个简单的例子。 The real one can grow even more complex. 真正的人可能变得更加复杂。 Keys can come up multiple times. 密钥可以出现多次。 Values as well and can be int or str. 值也可以是int或str。

Now the first problem is that I'm not quite sure how I have to correctly index this in elasticsearch so I can find something with specific requests. 现在的第一个问题是,我不确定如何在Elasticsearch中正确索引该索引,以便可以找到具有特定请求的内容。

I am using Django/Haystack where the index looks like this: 我正在使用Django / Haystack,其中的索引如下所示:

class FooIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    metadata = indexes.CharField(model_attr='get_metadata')
    # and some more specific fields

And the template: 和模板:

{
    "foo": {{ object.foo }},
    "metadata": {{ object.metadata}},
    # and some more
}

The metadata will then be filled with the sample above and the result will look like this: 然后,元数据将被上面的示例填充,结果将如下所示:

  {
    "foo": "someValue",
    "metadata": { 
      "somekey1": "val1",
      "someotherkey2": "val2",
      "more_data": { 
        "contains_more": [
          { 
            "foo": "val5",
            "bar": "val6"
          },
          { 
            "foo": "val66",
            "baz": "val44"
          },
        ],
        "even_more": {
          "foz" : 1234,
        }
      }
    },
  }

Which will go into the 'text' column in elasticsearch. 它将进入elasticsearch的“文本”列。

So the goal is now to be able to search for things like: 因此,现在的目标是能够搜索以下内容:

  • foo: val5 foo:val5
  • foz: 12* 福兹:12 *
  • bar: val* 酒吧:瓦尔*
  • somekey1: val1 somekey1:val1
  • and so on 等等

The second problem: When I search eg for foo: val5 it matches all objects that just have the key "foo" and all objects that have the val5 somewhere else in it's structure. 第二个问题:例如,当我搜索foo:val5时,它将匹配仅具有键“ foo”的所有对象以及结构中其他位置具有val5的所有对象。

This is how I search in Django: 这就是我在Django中搜索的方式:

self.searchqueryset.auto_query(self.cleaned_data['q'])

Sometimes the results are "okayish" sometime it's just completely useless. 有时结果是“好的”,有时是完全没有用的。

I could need a pointer in the right direction and get to know the mistakes I made here. 我可能需要一个正确方向的指针,并了解我在这里犯的错误。 Thank you! 谢谢!

Edit: I added my final solution as an answer below! 编辑:我添加了我的最终解决方案作为下面的答案!

The one thing that is certain is that you first need to craft a custom mapping based on your specific data and according to your query needs, my advice is that contains_more should be of nested type so that you can issue more precise queries on your fields. 可以肯定的一件事是,您首先需要根据您的特定数据并根据您的查询需求来设计自定义映射,我的建议是contains_more应该为nested类型,以便您可以在字段上发出更精确的查询。

I don't know the exact names of your fields, but based on what you showed, one possible mapping could be something like this. 我不知道您的字段的确切名称,但是根据您显示的内容,一种可能的映射可能是这样的。

{
  "your_type_name": {
    "properties": {
      "foo": {
        "type": "string"
      },
      "metadata": {
        "type": "object",
        "properties": {
          "some_key": {
            "type": "string"
          },
          "someotherkey2": {
            "type": "string"
          },
          "more_data": {
            "type": "object",
            "properties": {
              "contains_more": {
                "type": "nested",
                "properties": {
                  "foo": {
                    "type": "string"
                  },
                  "bar": {
                    "type": "string"
                  },
                  "baz": {
                    "type": "string"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Then, as already mentioned by mark in his comment, auto_query won't cut it, mainly because of the multiple nesting levels. 然后,正如mark在其评论中已经提到的那样, auto_query不会删除它,主要是因为有多个嵌套级别。 As far as I know, Django/Haystack doesn't support nested queries out of the box, but you can extend Haystack to support it. 据我所知,Django / Haystack不支持开箱即用的嵌套查询,但是您可以扩展Haystack来支持它。 Here is a blog post that explains how to tackle this: http://www.stamkracht.com/extending-haystacks-elasticsearch-backend . 这是一篇博客文章,解释了如何解决此问题: http : //www.stamkracht.com/extending-haystacks-elasticsearch-backend Not sure if this helps, but you should give it a try and let us know if you need more help. 不确定这是否有帮助,但是您应该尝试一下,如果需要更多帮助,请告诉我们。

Indexing : 索引:

First of all you should use dynamic templates , if you want to define specific mapping relatively to key name, or if your documents do not have the same structure. 首先,如果要相对于键名定义特定的映射,或者您的文档不具有相同的结构,则应使用动态模板

But 30 key isn't that high, and you should prefer defining your own mapping than letting Elasticsearch guessing it for you (in case incorrect data have been added first, mapping would be defined according to these data) 但是30键并不是那么高,您应该更喜欢定义自己的映射,而不是让Elasticsearch为您猜测(如果首先添加了不正确的数据,则将根据这些数据定义映射)

Searching: 搜索:

You can't search for 您无法搜寻

foz: val5

since "foz" key doesn't exist. 因为“ foz”键不存在。

But key "metadata.more_data.even_more.foz" does => all your keys are flatten from the root of your document 但是键“ metadata.more_data.even_more.foz”确实=>您所有的键都从文档的根开始展平

this way you'll have to search for 这样,您将必须搜索

foo: val5
metadata.more_data.even_more.foz: 12*
metadata.more_data.contains_more.bar: val*
metadata.somekey1: val1

Using query_string for example 以query_string为例

"query_string": {
    "default_field": "metadata.more_data.even_more.foz",
    "query": "12*"
}

Or if you want to search in multiple fields 或者,如果您想在多个字段中搜索

"query_string": {
    "fields" : ["metadata.more_data.contains_more.bar", "metadata.somekey1"],
    "query": "val*"
}

It took a while to figure out the right solution that works for me 花了一段时间才找到适合我的正确解决方案

It was a mix of both the provided answers by @juliendangers and @Val and some more customizing. 它既是@juliendangers@Val提供的答案, 又是更多的自定义项。

  1. I replaced Haystack with the more specific django-simple-elasticsearch 我用更具体的django-simple-elasticsearch替换了Haystack
  2. Added custom get_type_mapping method to the model 向模型添加了自定义get_type_mapping方法

     @classmethod def get_type_mapping(cls): return { "properties": { "somekey": { "type": "<specific_type>", "format": "<specific_format>", }, "more_data": { "type": "nested", "include_in_parent": True, "properties": { "even_more": { "type": "nested", "include_in_parent": True, } /* and so on for each level you care about */ } } } 
  3. Added custom get_document method to the model 向模型添加了自定义get_document方法

     @classmethod def get_document(cls, obj): return { 'somekey': obj.somekey, 'more_data': obj.more_data, /* and so on */ } 
  4. Add custom Searchform 添加自定义搜索表单

     class Searchform(ElasticsearchForm): q = forms.Charfield(required=False) def get_index(self): return 'your_index' def get_type(self): return 'your_model' def prepare_query(self): if not self.cleaned_data['q']: q = "*" else: q = str(self.cleaned_data['q']) return { "query": { "query_string": { "query": q } } } def search(self): esp = ElasticsearchProcessor(self.es) esp.add_search(self.prepare_query, page=1, page_size=25, index=self.get_index(), doc_type=self.get_type()) responses = esp.search() return responses[0] 

So this is what worked for me and covers my usecases . 所以这对我有用涵盖了我的用例 Maybe it can be of some help for someone. 也许对某人会有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM