简体   繁体   中英

How to update a Lucene.NET index?

I'm developing a Desktop Search Engine in Visual Basic 9 (VS2008) using Lucene.NET (v2.0).

I use the following code to initialize the IndexWriter

Private writer As IndexWriter

writer = New IndexWriter(indexDirectory, New StandardAnalyzer(), False)

writer.SetUseCompoundFile(True)

If I select the same document folder (containing files to be indexed) twice, two different entries for each file in that document folder are created in the index.

I want the IndexWriter to discard any files that are already present in the Index.

What should I do to ensure this?

As Steve mentioned, you need to use an instance of IndexReader and call its DeleteDocuments method. DeleteDocuments accepts either an instance of a Term object or Lucene's internal id of the document (it is generally not recommended to use the internal id as it can and will change as Lucene merges segments).

The best way is to use a unique identifier that you've stored in the index specific to your application. For example, in an index of patients in a doctor's office, if you had a field called "patient_id" you could create a term and pass that as an argument to DeleteDocuments. See the following example (sorry, C#):

int patientID = 12;
IndexReader indexReader = IndexReader.Open( indexDirectory );
indexReader.DeleteDocuments( new Term( "patient_id", patientID ) );

Then you could add the patient record again with an instance of IndexWriter. I learned a lot from this article http://www.codeproject.com/KB/library/IntroducingLucene.aspx .

Hope this helps.

There are many out-of-date examples out there on deleting with an id field. The code below will work with Lucene.NET 2.4.

It's not necessary to open an IndexReader if you're already using an IndexWriter or to access IndexSearcher.Reader. You can use IndexWriter.DeleteDocuments(Term), but the tricky part is making sure you've stored your id field correctly in the first place. Be sure and use Field.Index.NOT_ANALYZED as the index setting on your id field when storing the document. This indexes the field without tokenizing it, which is very important, and none of the other Field.Index values will work when used this way:

IndexWriter writer = new IndexWriter("\MyIndexFolder", new StandardAnalyzer());
var doc = new Document();
var idField = new Field("id", "MyItemId", Field.Store.YES, Field.Index.NOT_ANALYZED);
doc.Add(idField);
writer.AddDocument(doc);
writer.Commit();

Now you can easily delete or update the document using the same writer:

Term idTerm = new Term("id", "MyItemId");
writer.DeleteDocuments(idTerm);
writer.Commit();

If you want to delete all content in the index and refill it, you could use this statement

writer = New IndexWriter(indexDirectory, New StandardAnalyzer(), True)

The last parameter of the IndexWriter constructor determines whether a new index is created, or whether an existing index is opened for the addition of new documents.

To update a lucene index you need to delete the old entry and write in the new entry. So you need to use an IndexReader to find the current item, use writer to delete it and then add your new item. The same will be true for multiple entries which I think is what you are trying to do.Just find all the entries, delete them all and then write in the new entries.

There are options,listed below, which can be used as per requirements.

See below code snap. [Source code in C#, please convert it into vb.net]

Lucene.Net.Documents.Document doc = ConvertToLuceneDocument(id, data);
Lucene.Net.Store.Directory dir = Lucene.Net.Store.FSDirectory.Open(new DirectoryInfo(UpdateConfiguration.IndexTextFiles));
Lucene.Net.Analysis.Analyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
Lucene.Net.Index.IndexWriter indexWriter = new Lucene.Net.Index.IndexWriter(dir, analyzer, false, Lucene.Net.Index.IndexWriter.MaxFieldLength.UNLIMITED);
Lucene.Net.Index.Term idTerm = new Lucene.Net.Index.Term("id", id);

foreach (FileInfo file in new DirectoryInfo(UpdateConfiguration.UpdatePath).EnumerateFiles())
{
        Scenario 1: Single step update.
                indexWriter.UpdateDocument(idTerm, doc, analyzer);

        Scenario 2: Delete a document and then Update the document
                indexWriter.DeleteDocuments(idTerm);
                indexWriter.AddDocument(doc);

        Scenario 3: Take necessary steps if a document does not exist.

            Lucene.Net.Index.IndexReader iReader = Lucene.Net.Index.IndexReader.Open(indexWriter.GetDirectory(), true);
            Lucene.Net.Search.IndexSearcher iSearcher = new Lucene.Net.Search.IndexSearcher(iReader);
            int docCount = iSearcher.DocFreq(idTerm);
            iSearcher.Close();
            iReader.Close();
            if (docCount == 0)
            {
                    //TODO: Take necessary steps
                    //Possible Step 1: add document
                    //indexWriter.AddDocument(doc);

                    //Possible Step 2: raise the error for the unknown document
            }
}
indexWriter.Optimize();
indexWriter.Close();

Unless you're only modifying a small number of documents (say, less than 10% of the total) it's almost certainly faster (your mileage may vary depending on stored/indexed fields, etc) to reindex from scratch.

That said, I would always index to a temp directory, and then move the new one into place when it's done. That way, there's little downtime while the index is building, and if something goes wrong you still have a good index.

One option is of course to remove a document and then to add the updated version of the document.

Alternatively you can also use the UpdateDocument() method of the IndexWriter class:

writer.UpdateDocument(new Term("patient_id", document.Get("patient_id")), document);

This of course requires you to have a mechanism by which you can locate the document you want to update ("patient_id" in this example).

I have blogged more details with a more complete source code example .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM