简体繁体 English

用于经常更改的文档的Lucene索引策略

[英]Lucene indexing strategy for documents that change often

原文 2011-05-16 00:21:25 3 2 java/ lucene

I'm integrating search functionality into a desktop application and I'm using vanilla Lucene to do so. 我正在将搜索功能集成到桌面应用程序中，并且正在使用香草Lucene。 The application handles (potentially thousands) of POJOs each with its own set of key/value(s) properties. 该应用程序处理（可能是数千个）POJO，每个POJO都具有自己的一组键/值属性。 When mapping models between my application and Lucene I originally thought of assigning each POJO a Document and add the properties as Fields. 当在我的应用程序和Lucene之间映射模型时，我最初想到为每个POJO分配一个Document并将这些属性添加为Fields。 This approach works great as far as indexing and searching goes but the main downside is that whenever a POJO changes its properties I have to reindex ALL the properties again, even the ones that didn't change, in order to update the index. 这种方法在进行索引和搜索时效果很好，但主要缺点是，每当POJO更改其属性时，我都必须再次重新索引所有属性，即使是未更改的属性，也要更新索引。 I have been thinking of changing my approach and instead create a Document per property and assign the same id to all the Documents from the same POJO. 我一直在考虑改变我的方法，而是为每个属性创建一个Document，并为同一POJO中的所有Document分配相同的ID。 This way when a POJO property changes I only update its corresponding Document without reindexing all the other unchanged properties. 这样，当POJO属性发生更改时，我只会更新其相应的Document，而不会为所有其他未更改的属性重新编制索引。 I think that the graph db Neo4J follows a similar approach when comes to indexing, but I'm not completely sure. 我认为图db Neo4J在建立索引时遵循类似的方法，但是我不确定。 Could anyone comment on possible impact on performance, querying, etc? 谁能评论对性能，查询等可能造成的影响？

2 个解决方案

It depends fundamentally on what you want to return as a Document in a search result. 从根本上说，这取决于您要在搜索结果中作为文档返回的内容。

But indexing is pretty cheap. 但是索引很便宜。 Does a changed POJO really have so many properties that reindexing them all is a major problem? 更改后的POJO是否真的具有这么多的属性，以至于全部重新编制索引是一个主要问题？

If you only search one field in every search request, splitting one POJO to several documents will speed up reindexing. 如果在每个搜索请求中仅搜索一个字段，则将一个POJO拆分为多个文档将加快重新索引的速度。 But it will cause another problem if search one multiple fields, a POJO may appear many times. 但是，如果搜索多个字段，将导致另一个问题，POJO可能会出现多次。 Actually, I agree with EJP, building index is very fast in small dataset. 实际上，我同意EJP，在小型数据集中构建索引非常快。