简体   繁体   English

Lucene索引备份

[英]Lucene index backup

在不使索引脱机(热备份)的情况下备份lucene索引的最佳做法是什么?

You don't have to stop your IndexWriter in order to take a backup of the index. 您无需停止IndexWriter即可备份索引。

Just use the SnapshotDeletionPolicy, which lets you "protect" a given commit point (and all files it includes) from being deleted. 只需使用SnapshotDeletionPolicy,它可以“保护”某个提交点(及其包含的所有文件)被删除。 Then, copy the files in that commit point to your backup, and finally release the commit. 然后,将该提交点中的文件复制到备份中,最后释放提交。

It's fine if the backup takes a while to run -- as long as you don't release the commit point with SnapshotDeletionPolicy, the IndexWriter will not delete the files (even if, eg, they have since been merged together). 如果备份需要一段时间才能运行 - 只要你不使用SnapshotDeletionPolicy释放提交点,IndexWriter就不会删除文件(即使它们已经合并在一起)。

This gives you a consistent backup which is a point-in-time image of the index without blocking ongoing indexing. 这为您提供了一致的备份,这是索引的时间点映像,而不会阻止正在进行的索引。

I wrote about this in Lucene in Action (2nd edition), and there's paper excerpted from the book available (free) from http://www.manning.com/hatcher3 , "Hot Backups with Lucene", that describes this in more detail. 我在Lucene in Action(第2版)中写过这篇文章,并且摘自http://www.manning.com/hatcher3的书(免费),“使用Lucene的热备份”,该文章更详细地描述了这一点。 。

This answer depends upon (a) how big your index is and (b) what OS you are using. 这个答案取决于(a)索引的大小和(b)您使用的操作系统。 It is suitable for large indexes hosted on Unix operating systems, and is based upon the Solr 1.3 replication strategy. 它适用于Unix操作系统上托管的大型索引,并且基于Solr 1.3复制策略。

Once a file has been created, Lucene will not change it, it will only delete it. 创建文件后,Lucene不会更改它,它只会删除它。 Therefore, you can use a hard link strategy to make a backup. 因此,您可以使用硬链接策略进行备份。 The approach would be: 方法是:

  • stop indexing (and do a commit?), so that you can be sure you won't snapshot mid write 停止索引(并执行提交?),这样您就可以确保不会写入快照
  • create a hard link copy of your index files (using cp -lr) 创建索引文件的硬链接副本(使用cp -lr)
  • restart indexing 重新开始索引

The cp -lr will only copy the directory structure and not the files, so even a 100Gb index should copy in less than a second. cp -lr只会复制目录结构而不复制文件,因此即使100Gb索引也应该在不到一秒的时间内复制。

In my opinion it would typically be enough to stop any ongoing indexing operation and simply take a file copy of your index files. 在我看来,通常足以阻止任何正在进行的索引操作,只需获取索引文件的文件副本即可。 Also look at the snapshooter script from Solr which can be found in apache-solr-1.4.1/src/scripts , which essentially does: 另请参阅Solr的snapshooter脚本,该脚本可以在apache-solr-1.4.1/src/scripts ,它基本上可以:

cp -lr indexLocation backupLocation

Another options might be to have a look at the Directory.copy(..) routine for a progammatic approach (eg, using the same Directory given as constructor parameter to the IndexWriter . You might also be interested in Snapshooter.java which does the equivalent of the script. 另一个选项可能是查看编程方法的Directory.copy(..)例程(例如,使用与IndexWriter的构造函数参数相同的目录。您可能也对Snapshooter.java感兴趣,它具有相同的功能的脚本。

Create a new index with a separate IndexWriter and use addIndexesNoOptimize() to merge the running index into the new one. 使用单独的IndexWriter创建新索引,并使用addIndexesNoOptimize()将正在运行的索引合并到新索引中。 This is very slow, but it allows you keep the original index operational while doing the backup. 这非常慢,但它允许您在执行备份时保持原始索引的正常运行。

However, you cannot write to the index while merging. 但是,合并时无法写入索引。 So even if it is online and you can query the index, you cannot write to it during the backup. 因此,即使它在线并且您可以查询索引,也无法在备份期间写入它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM