简体   繁体   English

使用 Python 客户端通过映射将不规则 json 加载到 Elasticsearch 索引中

[英]Loading irregular json into Elasticsearch index with mapping using Python client

I have some .json where not all fields are present in all records, for eg caseclass.json looks like:我有一些 .json 文件,其中并非所有字段都出现在所有记录中,例如caseclass.json看起来像:

[{
    "name" : "john smith", 
    "age" : 12, 
    "cars": ["ford", "toyota"], 
    "comment": "i am happy"
},
{
    "name": "a. n. other", 
    "cars": "", 
    "comment": "i am panicking"
}]

Using Elasticsearch-7.6.1 via python client elasticsearch:通过 python 客户端 elasticsearch 使用 Elasticsearch-7.6.1:

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
import json
import os
from elasticsearch_dsl import Document, Text, Date, Integer, analyzer

es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
class Person(Document):
        class Index:
            using = es
            name = 'person_index'
        name = Text()
        age = Integer()
        cars = Text()
        comment = Text(analyzer='snowball')   

Person.init()

with open ("caseclass.json") as json_file:
data = json.load(json_file)
for indexid in range(len(data)):
    document = Person(name=data[indexid]['name'], age=data[indexid]['age'], cars=data[indexid]['cars'], comment=data[indexid]['comment'])
    document.meta.id = indexid
    document.save()

Naturally I get KeyError: 'age' when the second record is trying to be read.当然,当第二条记录试图被读取时,我得到KeyError: 'age' My question is: it is possible to load such records onto a Elasticsearch index using the Python client and a pre-defined mapping , instead of dynamic mapping?我的问题是:是否可以使用 Python 客户端和预定义映射而不是动态映射将此类记录加载到 Elasticsearch 索引上? Above code works if all fields are present in all records but is there a way to do this without checking presence of each field per record as the actual records have complex structure and there are millions of them?如果所有字段都存在于所有记录中,则上面的代码有效,但是有没有一种方法可以在不检查每个记录的每个字段的情况下执行此操作,因为实际记录具有复杂的结构并且有数百万个? Thanks谢谢

The error has nothing to do w/ your mapping -- it's just telling you that age could not be accessed in one of your caseclasses .该错误与您的映射无关 - 它只是告诉您在您的caseclasses之一中无法访问age

The index mapping is created when you call Person.init() -- you can verify that by calling print(es.indices.get_mapping(Person.Index.name)) right after Person.init() .索引映射是在您调用Person.init()时创建的——您可以通过在Person.init()之后Person.init()调用print(es.indices.get_mapping(Person.Index.name))来验证这一点。

I've cleaned up your code a bit:我已经清理了你的代码:

import json
import os
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Document, Text, Date, Integer, analyzer

es = Elasticsearch([{'host': 'localhost', 'port': 9200}])


class Person(Document):
    class Index:
        using = es
        name = 'person_index'
    name = Text()
    age = Integer()
    cars = Text()
    comment = Text(analyzer='snowball')


Person.init()
print(es.indices.get_mapping(Person.Index.name))

with open("caseclass.json") as json_file:
    data = json.load(json_file)
    for indexid, case in enumerate(data):
        document = Person(**case)
        document.meta.id = indexid
        document.save()

Notice how I used **case to spread all key-value pairs inside of a case instead of using data[property_key] .请注意我如何使用**case将所有键值对分布在一个case而不是使用data[property_key]

The generated mapping is as follows:生成的映射如下:

{
  "person_index" : {
    "mappings" : {
      "properties" : {
        "age" : {
          "type" : "integer"
        },
        "cars" : {
          "type" : "text"
        },
        "comment" : {
          "type" : "text",
          "analyzer" : "snowball"
        },
        "name" : {
          "type" : "text"
        }
      }
    }
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM