简体   繁体   English

如何在elasticsearch中真正重新索引数据

[英]How to really reindex data in elasticsearch

I have added new mappings (mainly not_analyzed versions of existing fields) I now have to figure out how to reindex the existing data.我添加了新的映射(主要是现有字段的 not_analyzed 版本)我现在必须弄清楚如何重新索引现有数据。 I have tried following the guide on elastic search website but that is just too confusing.我曾尝试按照弹性搜索网站上的指南进行操作,但这太令人困惑了。 I have also tried using plugins (elasticsearch-reindex, allegro/elasticsearch-reindex-tool).我也尝试过使用插件(elasticsearch-reindex、allegro/elasticsearch-reindex-tool)。 I have looked at ElasticSearch - Reindexing your data with zero downtime which is a similar question.我看过ElasticSearch - 以零停机时间重新索引您的数据,这是一个类似的问题。 I was hoping to not have to rely on external tools (if possible) and try and use bulk API (as with original insert)我希望不必依赖外部工具(如果可能)并尝试使用批量 API(与原始插入一样)

I could easily rebuild the whole index as it's a read only data really but that wont really work in the long term if I should want to add more fields etc etc when I'm in production with it.我可以轻松地重建整个索引,因为它确实是只读数据,但如果我想在生产中添加更多字段等,那么从长远来看这不会真正起作用。 I wondered if there was anyone who knows of an easy to understand/follow solution or steps for a relative novice to ES.我想知道是否有人知道一个易于理解/遵循的解决方案或 ES 相对新手的步骤。 I'm on version 2 and using Windows.我使用的是版本 2 并使用 Windows。

Re-indexing means to read the data, delete the data in elasticsearch and ingest the data again.重新索引意味着读取数据,删除elasticsearch中的数据并重新摄取数据。 There is no such thing like "change the mapping of existing data in place."没有像“改变现有数据的映射到位”这样的事情。 All the re-indexing tools you mentioned are just wrappers around read->delete->ingest.您提到的所有重新索引工具都只是围绕 read->delete->ingest 的包装。
You can always adjust the mapping for new indices and add fields later.您始终可以调整新索引的映射并稍后添加字段。 All the new fields will be indexed with respect to this mapping.所有新字段都将根据此映射建立索引。 Or use dynamic mapping if you are not in control of the new fields.或者,如果您无法控制新字段,请使用动态映射。
Have a look at Change default mapping of string to "not analyzed" in Elasticsearch to see how to use dynamic mapping to get not_analyzed fields of strings.查看在 Elasticsearch 中将字符串的默认映射更改为“未分析”以了解如何使用动态映射来获取字符串的 not_analyzed 字段。

Re-indexing is very expensive.重新索引非常昂贵。 Better way is to create a new index and drop the old one.更好的方法是创建一个新索引并删除旧索引。 To achieve this with zero downtime, use index alias for all your customers.要以零停机时间实现这一目标,请为所有客户使用索引别名。 Think of an index called "data-version1".想想一个名为“data-version1”的索引。 In steps:分步骤:

  • create your index "data-version1" and give it an alias named "data"创建索引“data-version1”并给它一个名为“data”的别名
  • only use the alias "data" in all your client applications仅在所有客户端应用程序中使用别名“data”
  • to update your mapping: create a new index (with the new mapping) called "data-version2" and put all your data in更新您的映射:创建一个名为“data-version2”的新索引(使用新映射)并将所有数据放入
  • to switch from version1 to version2: drop the alias "data" on version1 and create an alias "data" on version2 (or first create, then drop).从 version1 切换到 version2:删除 version1 上的别名“data”并在 version2 上创建别名“data”(或先创建,然后删除)。 the time in between those two steps your clients will have no (or double) data.这两个步骤之间的时间您的客户将没有(或双倍)数据。 but the time between dropping and creating an alias should be so short your clients shouldn't recognize it.但是删除和创建别名之间的时间应该很短,您的客户不应该识别它。

It's good practice to always use aliases.始终使用别名是一种很好的做法。

With version 2.3.4 a new api _reindex is available which will do exactly what it says.在 2.3.4 版本中,可以使用新的 api _reindex 来执行它所说的操作。 Basic usage is基本用法是

{
    "source": {
        "index": "currentIndex"
    },
    "dest": {
        "index": "newIndex"
    }
}

Elasticsearch Reindex from Remote host to Local Host example (Jan 2020 Update)Remote主机到Local主机的 Elasticsearch Reindex 示例(2020 年 1 月更新)

# show indices on this host
curl 'localhost:9200/_cat/indices?v'

# edit elasticsearch configuration file to allow remote indexing
sudo vi /etc/elasticsearch/elasticsearch.yml

## copy the line below somewhere in the file
>>>
# --- whitelist for remote indexing ---
reindex.remote.whitelist: my-remote-machine.my-domain.com:9200
<<<

# restart elaticsearch service
sudo systemctl restart elasticsearch

# run reindex from remote machine to copy the index named filebeat-2016.12.01
curl -H 'Content-Type: application/json' -X POST 127.0.0.1:9200/_reindex?pretty -d'{
  "source": {
    "remote": {
      "host": "http://my-remote-machine.my-domain.com:9200"
    },
    "index": "filebeat-2016.12.01"
  },
  "dest": {
    "index": "filebeat-2016.12.01"
  }
}'

# verify index has been copied
curl 'localhost:9200/_cat/indices?v'

I faced same problem.我遇到了同样的问题。 But i couldn't find any resource to update current index mapping and analyzer.但是我找不到任何资源来更新当前的索引映射和分析器。 My suggestion is to use scroll and scan api and reindex your data to new index with new mapping and new fields.我的建议是使用滚动和扫描 api并使用新映射和新字段将数据重新索引到新索引。

If you want like me a straight answer to this common and basic problem which is poorly adressed by elastic and the community in general, here is the code that works for me.如果你想像我一样直接回答这个普遍的和基本的问题,这个问题一般来说弹性和社区都没有得到很好的解决,这里是对我有用的代码。

Assuming you are just debugging, not in a production environment, and it is absolutely legitimate to add or remove fields because you absolutely don't care about downtime or latency:假设您只是在调试,而不是在生产环境中,添加或删除字段是绝对合法的,因为您绝对不关心停机时间或延迟:

# First of all: enable blocks write to enable clonage
PUT /my_index/_settings
{
  "settings": {
    "index.blocks.write": true
  }
}

# clone index into a temporary index
POST /my_index/_clone/my_index-000001  

# Copy back all documents in the original index to force their reindexetion
POST /_reindex
{
  "source": {
    "index": "my_index-000001"
  },
  "dest": {
    "index": "my_index"
  }
}

# Disable blocks write
PUT /my_index/_settings
{
  "settings": {
    "index.blocks.write": false
  }
}

# Finaly delete the temporary index
DELETE my_index-000001

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM