简体   繁体   English

写入Lucene索引,一次一个文档,随着时间的推移逐渐减慢

[英]Writing to Lucene index, one document at a time, slows down over time

We have a program, which runs continually, does various things, and changes some records in our database. 我们有一个程序,它可以持续运行,执行各种操作,并更改数据库中的某些记录。 Those records are indexed using Lucene. 这些记录使用Lucene编制索引。 So each time we change an entity we do something like: 因此,每次我们更改实体时,我们都会执行以下操作:

  1. open db transaction, open Lucene IndexWriter 打开db transaction,打开Lucene IndexWriter
  2. make the changes to the db in the transaction, and update that entity in Lucene by using indexWriter.deleteDocuments(..) then indexWriter.addDocument(..) . 进行更改,以在交易数据库,并通过更新Lucene的该实体indexWriter.deleteDocuments(..)然后indexWriter.addDocument(..)
  3. If all went well, commit the db transaction and commit the IndexWriter. 如果一切顺利,请提交db事务并提交IndexWriter。

This works fine, but over time, the indexWriter.commit() takes more and more time. 这很好,但随着时间的推移, indexWriter.commit()需要花费越来越多的时间。 Initially it takes about 0.5 seconds but after a few hundred such transactions it takes more than 3 seconds. 最初它需要大约0.5秒,但在几百次此类交易之后需要超过3秒。 I don't doubt it would take even longer if the script ran longer. 如果脚本运行时间更长,我不怀疑它会花费更长的时间。

My solution so far has been to comment out the indexWriter.addDocument(..) and indexWriter.commit() , and recreate the entire index every now and again by first using indexWriter.deleteAll() then re-adding all documents, within one Lucene transction/IndexWriter (about 250k documents in about 14 seconds). 到目前为止,我的解决方案是注释掉indexWriter.addDocument(..)indexWriter.commit() ,并首先使用indexWriter.deleteAll()然后重新添加所有文档,一次又一次地重新创建整个索引。 Lucene transction / IndexWriter(约14万秒内约250k文件)。 But this obviously goes against the transactional approach offered by databases and Lucene, which keeps the two in sync, and keeps the updates to the database visible to users of our tools who are searching using Lucene. 但这显然违背了数据库和Lucene提供的事务方法,它使两者保持同步,并使我们使用Lucene搜索的工具的用户可以看到对数据库的更新。

It seems strange that I can add 250k documents in 14 seconds, but adding 1 document takes 3 seconds. 我可以在14秒内添加250k文档,但添加1个文档需要3秒钟,这似乎很奇怪。 What am I doing wrong, how can I improve the situation? 我做错了什么,我该如何改善这种情况?

What you are doing wrong is assuming that Lucene's built-in transactional capabilities have performance and guarantees comparable to a typical relational database, when they really don't . 您所做错的是假设Lucene的内置事务功能具有与典型关系数据库相当的性能和保证, 而实际上并非如此 More specifically in your case, a commit syncs all index files with the disk, making commit times proportional to index size. 更具体地说,在您的情况下,提交将所有索引文件与磁盘同步,使提交时间与索引大小成比例。 That is why over time your indexWriter.commit() takes more and more time. 这就是为什么随着时间的indexWriter.commit()需要花费越来越多的时间。 The Javadoc for IndexWriter.commit() even warns that: IndexWriter.commit()Javadoc甚至警告:

This may be a costly operation, so you should test the cost in your application and do it only when really necessary. 这可能是一项代价高昂的操作,因此您应该在应用程序中测试成本,并且只在真正需要时才进行测试。

Can you imagine database documentation telling you to avoid doing commits? 你能想象数据库文档告诉你避免做提交吗?

Since your main goal seems to be to keep database updates visible through Lucene searches in a timely manner, to improve the situation, do the following: 由于您的主要目标似乎是通过Lucene及时搜索来保持数据库更新,以改善这种情况,请执行以下操作:

  1. Have indexWriter.deleteDocuments(..) and indexWriter.addDocument(..) trigger after a successful database commit, instead of before 在成功提交数据库之后触发indexWriter.deleteDocuments(..)indexWriter.addDocument(..) ,而不是之前
  2. Perform indexWriter.commit() periodically instead of every transaction, just to make sure your changes are eventually written to disk 定期执行indexWriter.commit()而不是每个事务,只是为了确保您的更改最终写入磁盘
  3. Use a SearcherManager for searching and invoke maybeRefresh() periodically to see updated documents within a reasonable time frame 使用SearcherManager进行搜索并定期调用maybeRefresh()以在合理的时间范围内查看更新的文档

The following is an example program which demonstrates how document updates can be retrieved by periodically performing maybeRefresh() . 以下是一个示例程序,演示如何通过定期执行maybeRefresh()来检索文档更新。 It builds an index of 100000 documents, uses a ScheduledExecutorService to set up periodic invocations of commit() and maybeRefresh() , prompts you to update a single document, then repeatedly searches until the update is visible. 它构建了一个包含100000个文档的索引,使用ScheduledExecutorService来设置commit()maybeRefresh()定期调用,提示您更新单个文档,然后重复搜索直到更新可见。 All resources are properly cleaned up on program termination. 在程序终止时正确清理所有资源。 Note that the controlling factor for when the update becomes visible is when maybeRefresh() is invoked, not commit() . 请注意,更新变为可见时的控制因素是调用maybeRefresh() ,而不是commit()

import java.io.IOException;
import java.nio.file.Paths;
import java.util.Scanner;
import java.util.concurrent.*;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;

public class LucenePeriodicCommitRefreshExample {
    ScheduledExecutorService scheduledExecutor;
    MyIndexer indexer;
    MySearcher searcher;

    void init() throws IOException {
        scheduledExecutor = Executors.newScheduledThreadPool(3);
        indexer = new MyIndexer();
        indexer.init();
        searcher = new MySearcher(indexer.indexWriter);
        searcher.init();
    }

    void destroy() throws IOException {
        searcher.destroy();
        indexer.destroy();
        scheduledExecutor.shutdown();
    }

    class MyIndexer {
        IndexWriter indexWriter;
        Future commitFuture;

        void init() throws IOException {
            indexWriter = new IndexWriter(FSDirectory.open(Paths.get("C:\\Temp\\lucene-example")), new IndexWriterConfig(new StandardAnalyzer()));
            indexWriter.deleteAll();
            for (int i = 1; i <= 100000; i++) {
                add(String.valueOf(i), "whatever " + i);
            }
            indexWriter.commit();
            commitFuture = scheduledExecutor.scheduleWithFixedDelay(() -> {
                try {
                    indexWriter.commit();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }, 5, 5, TimeUnit.MINUTES);
        }

        void add(String id, String text) throws IOException {
            Document doc = new Document();
            doc.add(new StringField("id", id, Field.Store.YES));
            doc.add(new StringField("text", text, Field.Store.YES));
            indexWriter.addDocument(doc);
        }

        void update(String id, String text) throws IOException {
            indexWriter.deleteDocuments(new Term("id", id));
            add(id, text);
        }

        void destroy() throws IOException {
            commitFuture.cancel(false);
            indexWriter.close();
        }
    }

    class MySearcher {
        IndexWriter indexWriter;
        SearcherManager searcherManager;
        Future maybeRefreshFuture;

        public MySearcher(IndexWriter indexWriter) {
            this.indexWriter = indexWriter;
        }

        void init() throws IOException {
            searcherManager = new SearcherManager(indexWriter, true, null);
            maybeRefreshFuture = scheduledExecutor.scheduleWithFixedDelay(() -> {
                try {
                    searcherManager.maybeRefresh();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }, 0, 5, TimeUnit.SECONDS);
        }

        String findText(String id) throws IOException {
            IndexSearcher searcher = null;
            try {
                searcher = searcherManager.acquire();
                TopDocs topDocs = searcher.search(new TermQuery(new Term("id", id)), 1);
                return searcher.doc(topDocs.scoreDocs[0].doc).getField("text").stringValue();
            } finally {
                if (searcher != null) {
                    searcherManager.release(searcher);
                }
            }
        }

        void destroy() throws IOException {
            maybeRefreshFuture.cancel(false);
            searcherManager.close();
        }
    }

    public static void main(String[] args) throws IOException {
        LucenePeriodicCommitRefreshExample example = new LucenePeriodicCommitRefreshExample();
        example.init();
        Runtime.getRuntime().addShutdownHook(new Thread() {
            @Override
            public void run() {
                try {
                    example.destroy();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        });

        try (Scanner scanner = new Scanner(System.in)) {
            System.out.print("Enter a document id to update (from 1 to 100000): ");
            String id = scanner.nextLine();
            System.out.print("Enter what you want the document text to be: ");
            String text = scanner.nextLine();
            example.indexer.update(id, text);
            long startTime = System.nanoTime();
            String foundText;
            do {
                foundText = example.searcher.findText(id);
            } while (!text.equals(foundText));
            long elapsedTimeMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startTime);
            System.out.format("it took %d milliseconds for the searcher to see that document %s is now '%s'\n", elapsedTimeMillis, id, text);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            System.exit(0);
        }
    }
}

This example was successfully tested using Lucene 5.3.1 and JDK 1.8.0_66. 使用Lucene 5.3.1和JDK 1.8.0_66成功测试了此示例。

My first approach: do not commit that often. 我的第一种方法:不要经常这样做。 When you delete and re-add document you will probably trigger a merge. 删除并重新添加文档时,可能会触发合并。 Merges are somewhat slow. 合并有点慢。

If you use a near real-time IndexReader you can still search like you used to (it does not show deleted documents), but you do not get the commit penalty. 如果您使用近乎实时的IndexReader,您仍然可以像以前一样进行搜索(它不会显示已删除的文档),但是您不会受到提交惩罚。 You can always commit later, to make sure the file system is in sync with your index. 您可以随时提交,以确保文件系统与索引同步。 You can do this while using your index, so you do not have to block all other operations. 您可以在使用索引时执行此操作,因此您不必阻止所有其他操作。

See also this interesting blog post (and do read the other posts as well, they provide great information). 另请参阅这篇有趣的博客文章 (并阅读其他帖子,它们提供了很多信息)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 随着时间的流逝,应用程序变慢-Java + Python - Application slows down over time - Java + Python 性能-响应时间变慢 - Performance — response time slows down PostgreSQL executeBatch()随着时间的流逝而变慢 - PostgreSQL executeBatch() slows down with time 在while循环中创建Lucene文档的速度越来越慢 - Lucene Document creation in while loop slows down more and more 经过大量迭代后,Java while循环会随着时间的推移而显着降低速度 - Java while loop dramatically slows down over time after a large number of iterations SpEL getValue()函数随着时间的过去而变慢 - SpEL getValue() function slows over time 如何持久化Lucene文档索引,以便在每次程序启动时都不需要将文档加载到其中? - How to persist the Lucene document index so that the documents do not need to be loaded into it each time the program starts up? Lucene:在索引时间覆盖词频 - Lucene: Overwrite Term Frequency at Index Time 处理大量具有分页的数据库条目随着时间的推移而减慢 - processing a large number of database entries with paging slows down with time 为什么星星自转一段时间后会自动减慢 - Why does the star rotation slows down automatically after some time
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM