简体   繁体   English

python elasticsearch批量索引数据类型

[英]python elasticsearch bulk index datatype

I am using the following code to create an index and load data in elastic search我正在使用以下代码在弹性搜索中创建索引并加载数据

from elasticsearch import helpers, Elasticsearch
import csv
es = Elasticsearch()
es = Elasticsearch('localhost:9200')
index_name='wordcloud_data'
with open('./csv-data/' + index_name +'.csv') as f:
    reader = csv.DictReader(f)
    helpers.bulk(es, reader, index=index_name, doc_type='my-type')

print ("done")

My CSV data is as follows我的CSV数据如下

date,word_data,word_count
2017-06-17,luxury vehicle,11
2017-06-17,signifies acceptance,17
2017-06-17,agency imposed,16
2017-06-17,customer appreciation,11

The data loads fine but then the datatype is not accurate How do I force it to say that the word_count is integer and not text See how it figures out the date type ?数据加载正常,但数据类型不准确 如何强制它说 word_count 是整数而不是文本 看看它如何计算日期类型? Is there a way it can figure out the int datatype automatically ?有没有办法自动找出 int 数据类型? or by passing some parameter ?或者通过传递一些参数?

Also what do I do to increase the ignore_above or remove it for some of the fields if I wanted to.另外,如果我愿意,我该怎么做来增加 ignore_above 或删除某些字段的它。 basically no limit to the number of characters ?基本没有字符数限制?

{
  "wordcloud_data" : {
    "mappings" : {
      "my-type" : {
        "properties" : {
          "date" : {
            "type" : "date"
          },
          "word_count" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "word_data" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
  }
}

You need to create a mapping that would describe field types.您需要创建一个描述字段类型的映射

With the elasticsearch-py client this can be done using es.indices.put_mapping or index.create methods, by passing it JSON document that describes mappings, like shown in this SO answer .使用es.indices.put_mapping elasticsearch-py客户端,这可以使用es.indices.put_mappingindex.create方法来完成,方法是向它传递描述映射的 JSON 文档,如这个 SO answer 所示 It would be something like this:它会是这样的:

es.indices.put_mapping(
    index="wordcloud_data",
    doc_type="my-type",
    body={
        "properties": {  
            "date": {"type":"date"},
            "word_data": {"type": "text"},
            "word_count": {"type": "integer"}
        }
    }
)

However, I'd suggest to take a look at the elasticsearch-dsl package that provides much nicer declarative API to describe things .但是,我建议看一下elasticsearch-dsl包,它提供了更好的声明性 API 来描述事物 It would be something along those lines (untested):这将是沿着这些路线的东西(未经测试):

from elasticsearch_dsl import DocType, Date, Integer, Text
from elasticsearch_dsl.connections import connections
from elasticsearch.helpers import bulk

connections.create_connection(hosts=["localhost"])

class WordCloud(DocType):
    word_data = Text()
    word_count = Integer()
    date = Date()

    class Index:
        name = "wordcloud_data"
        doc_type = "my_type"   # If you need it to be called so

WordCloud.init()
with open("./csv-data/%s.csv" % index_name) as f:
    reader = csv.DictReader(f)
    bulk(
        connections.get_connection(),
        (WordCloud(**row).to_dict(True) for row in reader)
    )

Please note, I haven't tried the code I've posted - just written it.请注意,我还没有尝试过我发布的代码 - 只是写了它。 Don't have an ES server at hand to test.手头没有 ES 服务器进行测试。 There could be some small mistakes or typos there (please point out if there are), but the general idea should be correct.那里可能有一些小错误或错别字(如果有请指出),但总体思路应该是正确的。

Thanks.谢谢。 @drdaeman's Solution worked for me. @drdaeman 的解决方案对我有用。 Although, I thought it's worth mentioning that in elasticsearch-dsl 6+虽然,我认为值得一提的是在 elasticsearch-dsl 6+

class Meta:
     index = "wordcloud_data"
     doc_type = "my-type"

This snippet will raise cannot write to wildcard index exception.此代码段将引发cannot write to wildcard index异常。 Change the following to,将以下内容更改为,

class Index:
   name = 'wordcloud_data'
   doc_type = 'my_type'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM