简体   繁体   English

使用批量 api elasticsearch kibana 更改 csv 列的字段格式

[英]Change field format for csv column using bulk api elasticsearch kibana

I want to change the type of one of the columns of my.csv file that I import via bulk api in elastic search in python.我想在 python 的弹性搜索中更改我通过批量 api 导入的 my.csv 文件的其中一列的类型。 The column contains dates but is imported as a string (however, when I upload the file manually in kibana, it takes it in date format).该列包含日期,但作为字符串导入(但是,当我在 kibana 中手动上传文件时,它采用日期格式)。

es = Elasticsearch()
    with open('user.csv') as f:
        reader = csv.DictReader(f)
        helpers.bulk(es, reader, index='user', doc_type='my-type')

I already tried mapping but it doesn't work:我已经尝试过映射,但它不起作用:

mapping = {
  "mappings": {
    "my-type": {
      "properties": {
        "('affiliation',)": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "('banned',)": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "('bracket',)": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "('country',)": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "('created',)": {
          "type": "date",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "('email',)": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "('hidden',)": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "('id',)": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "('name',)": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "('oauth_id',)": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "('password',)": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "('promotion',)": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "('school',)": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "('secret',)": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "('speciality',)": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "('type',)": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "('verified',)": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "('website',)": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}
    es.indices.create(index='user', ignore=400, body=mapping)
    with open('user.csv') as f:
        reader = csv.DictReader(f)
        helpers.bulk(es, reader, index='user', doc_type='csv')

Do you have any ideas or solutions?你有什么想法或解决方案吗? Thanks a lot !非常感谢 !

The doc types need to be consistent in order for the correct mapping to be applied.文档类型需要保持一致才能应用正确的映射。 Your first vs second call:您的第一个与第二个电话:

helpers.bulk(es, reader, index='user', doc_type= 'my-type' ) helpers.bulk(es, reader, index='user', doc_type= 'my-type' )

helpers.bulk(es, reader, index='user', doc_type= 'csv' ) helpers.bulk(es, reader, index='user', doc_type= 'csv' )

If your mapping configures 'my-type' , reference it as such in all subsequent function calls.如果您的映射配置'my-type' ,请在所有后续 function 调用中引用它。

But more importantly, reading from a CSV doesn't guarantee any original column types -- most of them will be read in as strings, As such.但更重要的是,从 CSV 读取并不能保证任何原始列类型——它们中的大多数将作为字符串读入,因此。 it's recommended to pre-process your docs' attributes to guarantee they'll be treated correctly -- ie, dates, numbers, booleans.建议对您的文档属性进行预处理,以保证它们会被正确处理——即日期、数字、布尔值。 etc.等等

In the function generateBulkPayload below you can parse/modify select values right before they're inserted into ES:在下面的 function generateBulkPayload中,您可以在将 select 值插入 ES 之前对其进行解析/修改:

import csv
from elasticsearch import Elasticsearch
from elasticsearch import helpers

es = Elasticsearch()

index_name = "user"
doc_type = "my-type"

mapping = {
    "mappings": {
        "my-type": {
            "created": {
                "type": "date",
                "format": "epoch_millis"  # assuming you're dealing with millisecond timestamps
            }
        }
    }
}

es.indices.create(index=index_name, ignore=400, body=mapping)


def generateBulkPayload(csv_reader):
    for row in csv_reader:
        # handle your parsing here
 
        # overwriting the `created` attribute
        row.update(dict(created=int(row.get('created'))))

        yield row


with open('user.csv') as f:
    reader = csv.DictReader(f)
    helpers.bulk(es,
                 generateBulkPayload(reader),
                 index=index_name,
                 doc_type=doc_type)

This code compiles well but the date format is still not recognized by elasticsearch.此代码编译良好,但 elasticsearch 仍然无法识别日期格式。 What to do so that elasticsearch recognizes it?怎么做才能让 elasticsearch 识别?

def generateBulkPayload(csv_reader):
    for row in csv_reader:
        created=row.get("('created',)") # Base format : 2021-03-04 13:56:16.663801
        datetime = parser.parse(created)
        epoch= datetime.timestamp()

        row.update(dict(created=int(epoch)))

        yield row

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM