简体   繁体   中英

ElasticSearch Delete and re-Create Index

I just started to learn Elasticsearch index; however, in my everyday business, I need to update es-index daily. You may assume that the data is changing every day, so I delete my index name and re-create the same one and insert data into re-created es-index, but sometimes our client says the data will disappear sometimes, and after refreshing it again, it will come out later, so it would affect the request from client end? may I ask if this is the right way to update the index, I know there is one called update by query method, but I just wanna know if this is a bad way to update es index? Or how it will affect the result that comes out on the webpage, called by front-end developers?


So basically my code looks like this:


    if self.es.indices.exist(index=idx):
        self.es.indices.delete(index=idx)
    self.es.indices.create(index=idx, body=body)
    for line in self.course_info_df.collect():
        tmp = line.asDict()
        output = {"x": tmp['x'],
                  "y": tmp['y'],
                  "z": tmp['z']
        self.es.index(index=idx, doc_type='doc', body=output)

So this is python library called ElasticSearch, each time I check if the index exists, if it is, then i delete it and re-create new one.

However, the reason why I need to recreate new index, because I need to update the data for our recommnedation service, so i submit spark offline Job to cluster then store those results into es-index, so you may assume I am using ES as a database to store data, because it will be used by front-end developer to write data into webpage.

Or does spark support some functions to upload dataframe as a es-format data into es-index?

I don't understand why you do index all documents every day, but it can be something like business logic. Anyhow. Here some solutions for you :

Solution 1

You can use aliases for these things. Think that you have products and you want to reindex them everyday. So first day, you can create an index as products-00001 . You can index all data to this index, and you can create an alias as products with the following request :

POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "products-00001",
        "alias": "products"
      }
    }
  ]
}

So, the next day, you can create another index to the cluster as products-00002 and index all the data to your new index first. Then, you can switch alias to your new index:

POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "products-00002",
        "alias": "products"
      }
    },
    {
      "remove": {
        "index": "products-00001",
        "alias": "products"
      }
    }
  ]
}

There is an add action above for products-00002 and a remove action for products-00001 . So, the alias name will be removed from products-00001 and appended to products-00002 . After this operation, you can remove products-00001 index. For the next day, you can create your new index explained as above again as products-00003 . And so on.

On the front-end side, they will use products name as the index name. they won't change anything on their side.

For more information : https://www.elastic.co/guide/en/elasticsearch/reference/7.17/aliases.html#aliases

Solution 2

I think that looping all documents can reindexing them can be wrong. In my opinion, it is wrong. But, according to some business logic, this can be correct. Anyhow. You can use the update operation. In this case, you don't need to use an alias or other things. Just only, 1 time you need to index all documents and then for every document update, you need to update documents 1 by 1 or bulk way. You can use partial updates even. As I understand, you already have about _update_by_query even. Just you need to know that these updates will increase the number of deleted documents number for indexes :

GET _cat/indices?v&s=index&h=index,docs.count,docs.deleted

index                                 docs.count docs.deleted
some-index.analytics.2022.05                   1            0
some-index.other.2022.05                    3058          250

You can see the number of deleted documents with _cat/indices request as above. These deleted documents can affect your search performance. For this reason, you need to merge operation time by time, or you need to configure it with the correct frequency on the index but before doing anything, please read these pages :

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM