简体繁体 English

使用Lucene的搜索应用程序可处理多达4000万份全文文档

[英]Search application to handle up to 40 million full-text documents using Lucene

原文 2013-11-26 12:12:08 8 1 java/ search/ lucene/ indexing/ loading

My current search application is using lucene for indexing process. 我当前的搜索应用程序正在使用lucene进行索引处理。 And if any documents are change, I believe, we can start re-indexing at the beginning. 我相信，如果有任何文档更改，我们可以从头开始重新索引。 Is this Correct? 这个对吗？

So, if yes, then all documents have to re-indexed each time with new ones are added which will be not appropriate with very large number of content about 40 million full-text documents. 因此，如果是的话，则每次添加新文档时都必须重新索引所有文档，这对于包含大约4000万个全文文档的大量内容来说是不合适的。

That's why I am specifically concerned that, using Lucene, Is there any way to only index documents that have changed so that to avoid the full re-indexing. 这就是为什么我特别担心的是，使用Lucene，是否有任何方法可以仅对已更改的文档编制索引，从而避免完全重新编制索引。

Appreciated for possible suggestions... 感谢可能的建议...

Thanking you........ 感谢您........

1 个解决方案

You only need to reindex changed documents, there is no need to reindex everything. 您只需要重新索引已更改的文档，就无需重新索引所有内容。 IndexWriter has deleteDocuments which can remove documents by query or term. IndexWriter具有deleteDocuments ，可以按查询或术语删除文档。 Then, you can reinsert the changed document with addDocument and commit to make this appear atomic. 然后，您可以使用addDocument重新插入更改的文档，并commit以使其看起来像原子的。

Also bear in mind that Lucene is just a library and has no idea what kind of external entities are passed for indexing and how/when they change - you, as a developer, are responsible for this. 还请记住，Lucene只是一个库，不知道传递什么样的外部实体进行索引以及如何/何时更改它们-作为开发人员，您应对此负责。