[英]incremental indexing lucene
I'm making an application in Java using Lucene 3.6 and want to make an incremental rate. 我正在使用Lucene 3.6在Java中创建应用程序,并且希望提高增量率。 I have already created the index, and I read that you have to do is open the existing index, and check each document indexing and document modification dates to see if they differ delete the index file and re-add again.
我已经创建了索引,并且我读到你要做的就是打开现有的索引,并检查每个文档的索引和文档修改日期,看它们是否不同,删除索引文件并重新添加。 My problem is I do not know how to do that in Java Lucene.
我的问题是我不知道如何在Java Lucene中这样做。
Thanks 谢谢
My code is: 我的代码是:
public static void main(String[] args)
throws CorruptIndexException, LockObtainFailedException,
IOException {
File docDir = new File("D:\\PRUEBASLUCENE");
File indexDir = new File("C:\\PRUEBA");
Directory fsDir = FSDirectory.open(indexDir);
Analyzer an = new StandardAnalyzer(Version.LUCENE_36);
IndexWriter indexWriter
= new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED);
long numChars = 0L;
for (File f : docDir.listFiles()) {
String fileName = f.getName();
Document d = new Document();
d.add(new Field("Name",fileName,
Store.YES,Index.NOT_ANALYZED));
d.add(new Field("Path",f.getPath(),Store.YES,Index.ANALYZED));
long tamano = f.length();
d.add(new Field("Size",""+tamano,Store.YES,Index.ANALYZED));
long fechalong = f.lastModified();
d.add(new Field("Modification_Date",""+fechalong,Store.YES,Index.ANALYZED));
indexWriter.addDocument(d);
}
indexWriter.optimize();
indexWriter.close();
int numDocs = indexWriter.numDocs();
System.out.println("Index Directory=" + indexDir.getCanonicalPath());
System.out.println("Doc Directory=" + docDir.getCanonicalPath());
System.out.println("num docs=" + numDocs);
System.out.println("num chars=" + numChars);
} }
Thanks Edmondo1984, you are helping me a lot. 谢谢Edmondo1984,你帮了我很多忙。
Finally I did the code as shown below. 最后我做了如下所示的代码。 Storing the hash of the file, and then checking the modification date.
存储文件的哈希值,然后检查修改日期。
In 9300 index files takes 15 seconds, and re-index (without any index has not changed because no file) takes 15 seconds. 在9300索引文件需要15秒,并且重新索引(没有任何索引没有因为没有文件而改变)需要15秒。 Am I doing something wrong or I can optimize the code to take less?
我做错了什么还是我可以优化代码以减少占用?
Thanks jtahlborn, doing what I managed to equalize indexReader times to create and update. 谢谢jtahlborn,做我设法均衡indexReader时间来创建和更新。 Are not you supposed to update an existing index should be faster to recreate it?
是不是应该更新现有索引应该更快地重新创建它? Is it possible to further optimize the code?
是否有可能进一步优化代码?
if(IndexReader.indexExists(dir))
{
//reader is a IndexReader and is passed as parameter to the function
//searcher is a IndexSearcher and is passed as parameter to the function
term = new Term("Hash",String.valueOf(file.hashCode()));
Query termQuery = new TermQuery(term);
TopDocs topDocs = searcher.search(termQuery,1);
if(topDocs.totalHits==1)
{
Document doc;
int docId,comparedate;
docId=topDocs.scoreDocs[0].doc;
doc=reader.document(docId);
String dateIndString=doc.get("Modification_date");
long dateIndLong=Long.parseLong(dateIndString);
Date date_ind=new Date(dateIndLong);
String dateFichString=DateTools.timeToString(file.lastModified(), DateTools.Resolution.MINUTE);
long dateFichLong=Long.parseLong(dateFichString);
Date date_fich=new Date(dateFichLong);
//Compare the two dates
comparedates=date_fich.compareTo(date_ind);
if(comparedate>=0)
{
if(comparedate==0)
{
//If comparation is 0 do nothing
flag=2;
}
else
{
//if comparation>0 updateDocument
flag=1;
}
}
According to Lucene data model, you store documents inside the index. 根据Lucene数据模型,您可以将文档存储在索引中。 Inside each document you will have the fields that you want to index, which are so called "analyzed" and the fields which are not "analyzed", where you can store a timestamp and other information you might need later.
在每个文档中,您将拥有要编制索引的字段,即所谓的“已分析”字段和未“分析”的字段,您可以在其中存储时间戳以及稍后可能需要的其他信息。
I have the feeling you have a certain confusion between files and documents, because in your first post you speak about documents and now you are trying to call IndexFileNames.isDocStoreFile(file.getName()) which actually tells only if file is a file containing a Lucene index. 我觉得你在文件和文档之间有一定的混淆,因为在你的第一篇文章中你谈到文档,现在你试图调用IndexFileNames.isDocStoreFile(file.getName()),它实际上只告诉文件是否包含文件一个Lucene索引。
If you understand Lucene object model, writing the code you need takes approximately three minutes: 如果您了解Lucene对象模型,那么编写所需的代码大约需要三分钟:
If on the other side you are sure that you want always to modify the previous value, you can refer to this snippet from Lucene in Action: 如果在另一方面您确定要始终修改以前的值,则可以参考Lucene in Action中的此片段:
public void testUpdate() throws IOException {
assertEquals(1, getHitCount("city", "Amsterdam"));
IndexWriter writer = getWriter();
Document doc = new Document();
doc.add(new Field("id", "1",
Field.Store.YES,
Field.Index.NOT_ANALYZED));
doc.add(new Field("country", "Netherlands",
Field.Store.YES,
Field.Index.NO));
doc.add(new Field("contents",
"Den Haag has a lot of museums",
Field.Store.NO,
Field.Index.ANALYZED));
doc.add(new Field("city", "Den Haag",
Field.Store.YES,
Field.Index.ANALYZED));
writer.updateDocument(new Term("id", "1"),
doc);
writer.close();
assertEquals(0, getHitCount("city", "Amsterdam"));
assertEquals(1, getHitCount("city", "Den Haag"));
}
As you see, the snippets uses a non analyzed ID as I was suggesting to save a queryable - simple attribute, and method updateDocument to first delete and then re-add the doc. 如您所见,片段使用非分析ID,因为我建议保存可查询 - 简单属性,方法updateDocument首先删除然后重新添加文档。
You might want to directly check the javadoc at 你可能想直接检查javadoc
http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,org.apache.lucene.document.Document ) http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,org.apache.lucene.document。文件 )
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.