简体   繁体   English

Apache Lucene在文件路径上建立索引和搜索

[英]apache lucene indexing and searching on the filepath

I am using apache lucene to index the html files. 我正在使用Apache Lucene来索引html文件。 I am storing the path of the html files in the lucene index . 我将html文件的路径存储在lucene索引中。 Its storing the index and , i have checked it in luke all. 它存储索引和,我已经检查了一下。 But when i am searching the path of the file its returning the no of documents very much high . 但是当我搜索文件的路径时,它返回的文档数非常高。 i want it should search the exact path as it was stored in the lucene index. 我希望它应该搜索存储在lucene索引中的确切路径。 i am using the following code 我正在使用以下代码

for index creation


   try{
         File indexDir=new File("d:/abc/")
        IndexWriter indexWriter = new IndexWriter(
             FSDirectory.open(indexDir),
            new SimpleAnalyzer(),
            true,
            IndexWriter.MaxFieldLength.LIMITED);
            indexWriter.setUseCompoundFile(false);
        Document doc= new Document();
        String path=f.getCanonicalPath();
          doc.add(new Field("fpath",path,
        Field.Store.YES,Field.Index.ANALYZED));
        indexWriter.addDocument(doc);
        indexWriter.optimize();
        indexWriter.close();
     }
    catch(Exception ex )
    {
     ex.printStackTrace();
    }



  Following the code for searching the filepath

        File indexDir = new File("d:/abc/");
           int maxhits = 10000000;
                     int len = 0;
                try {
                    Directory directory = FSDirectory.open(indexDir);
                     IndexSearcher searcher = new IndexSearcher(directory, true);
                    QueryParser parser = new QueryParser(Version.LUCENE_36,"fpath", new SimpleAnalyzer());
                    Query query = parser.parse(path);
                    query.setBoost((float) 1.5);
                    TopDocs topDocs = searcher.search(query, maxhits);
                    ScoreDoc[] hits = topDocs.scoreDocs;
                   len = hits.length;
                   JOptionPane.showMessageDialog(null,"items found"+len);

                 }
                catch(Exception ex)
               {
                 ex.printStackTrace();
              }

its showing the no of documents found as total no of document while the searched path file exists only once 其显示的文档数为文档总数,而搜索到的路径文件仅存在一次

You are analyzing the path, which will split it into separate terms. 您正在分析路径,这会将其分成单独的术语。 The root path term (like catalog in /catalog/products/versions ) likely occurs in all documents, so any search that includes catalog without forcing all terms to be mandatory will return all documents. 根路径项(如在目录 /目录/产品/版本 )可能发生在所有的文件,从而使包括目录 ,而不强迫所有方面进行任何搜索,以强制将返回所有文档。

You need a search query like (using the example above): 您需要一个类似的搜索查询(使用上面的示例):

+catalog +products +versions

to force all terms to be present. 强制所有条款都存在。

Note that this gets more complicated if the same set of terms can occur in different orders, like: 请注意,如果一组相同的术语可以以不同的顺序出现,则会变得更加复杂,例如:

/catalog/products/versions
/versions/catalog/products/SKUs

In that case, you need to use a different Lucene tokenizer than the tokenizer in the Standard Analyzer. 在这种情况下,您需要使用与标准分析器中的标记器不同的Lucene标记器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM