简体繁体 English

在elasticsearch中更新文档的效率

[英]Efficiency of updating documents in elasticsearch

原文 2014-01-17 11:20:00 0 1 java/ lucene/ nosql/ elasticsearch

I have been using elasticsearch successfully now for a year or so whereby I have been loading millions of documents and running various queries and facets against the data.我已经成功地使用了弹性搜索一年左右，我已经加载了数百万个文档并针对数据运行了各种查询和方面。

I have recently been asked by some of my users if it is possible to 'mark documents as read' and thus they can be excluded from search results.最近我的一些用户问我是否可以“将文档标记为已读”，从而可以将它们从搜索结果中排除。

I have successfully implemented this without issue, but now I'm wondering if I have chosen the best implementation.我已经成功地实现了这个没有问题，但现在我想知道我是否选择了最好的实现。 My understanding is that updating a document in ES(or any lucene implementation) is in effect the same as deleting and re-indexing.我的理解是，在 ES（或任何 lucene 实现）中更新文档实际上与删除和重新索引相同。

My question to the lucene/ES community... Will their be any negative impacts as a result of updating documents as a user driven adhoc task?我对 lucene/ES 社区的问题......是否会因为将文档更新为用户驱动的临时任务而产生任何负面影响？ (And can you suggest an alternative?) （你能提出一个替代方案吗？）

Thanks, JayTee谢谢，杰伊

1 个解决方案

Yes, there will be a performance overhead for re-indexing.是的，重新索引会产生性能开销。 This is given as "non-negligible" at https://www.elastic.co/blog/managing-relations-inside-elasticsearch (here its talking about a nested doc, but updating a field on a normal (a doc without nested fields) is the same这在https://www.elastic.co/blog/managing-relations-inside-elasticsearch 中作为“不可忽略的”给出（这里谈论的是嵌套文档，但更新了一个正常的字段（一个没有嵌套的文档）字段）是一样的

"If your data changes often, nested documents can have a non-negligible overhead associated with reindexing." “如果您的数据经常更改，嵌套文档可能会产生与重新索引相关的不可忽略的开销。”

An alternative is given later in that article - namely Parent/Child该文章后面给出了另一种选择 - 即父/子

"Parent/Child removes this limitation by separating the two documents and only loosely coupling them... means you are more free to update/delete children docs, since they have no effect on the parent or other children. “父/子通过分离两个文档并仅将它们松散耦合来消除此限制......意味着您可以更自由地更新/删除子文档，因为它们对父或其他子文档没有影响。

The downside is ...(queries).. aren't quite as fast .. since they are not colocated in the same Lucene block."缺点是……（查询）……没有那么快……因为它们不在同一个 Lucene 块中。”

So if every doc you have will eventually be updated to "read" - that will involve the overhead of re-indexing your entire datastore.因此，如果您拥有的每个文档最终都会更新为“读取”——这将涉及重新索引整个数据存储的开销。 If thats going to happen slowly over time, maybe you architecture can handle it.如果这会随着时间的推移慢慢发生，也许你的架构可以处理它。

If you are concerned that a high number of docs could be marked as read, and that will create a large load on your system, you can use a parent child relationship for the read field.如果您担心大量文档可能被标记为已读，并且这会给您的系统带来很大的负载，您可以对读取字段使用父子关系。 But there will be (as I understand a minor) extra overhead to run the query "only give docs where the child field 'read' is false"但是将有（据我所知是次要的）额外开销来运行查询“只提供子字段'读取'为假的文档”