简体   繁体   中英

apache lucene indexing and searching on the filepath

I am using apache lucene to index the html files. I am storing the path of the html files in the lucene index . Its storing the index and , i have checked it in luke all. But when i am searching the path of the file its returning the no of documents very much high . i want it should search the exact path as it was stored in the lucene index. i am using the following code

for index creation


   try{
         File indexDir=new File("d:/abc/")
        IndexWriter indexWriter = new IndexWriter(
             FSDirectory.open(indexDir),
            new SimpleAnalyzer(),
            true,
            IndexWriter.MaxFieldLength.LIMITED);
            indexWriter.setUseCompoundFile(false);
        Document doc= new Document();
        String path=f.getCanonicalPath();
          doc.add(new Field("fpath",path,
        Field.Store.YES,Field.Index.ANALYZED));
        indexWriter.addDocument(doc);
        indexWriter.optimize();
        indexWriter.close();
     }
    catch(Exception ex )
    {
     ex.printStackTrace();
    }



  Following the code for searching the filepath

        File indexDir = new File("d:/abc/");
           int maxhits = 10000000;
                     int len = 0;
                try {
                    Directory directory = FSDirectory.open(indexDir);
                     IndexSearcher searcher = new IndexSearcher(directory, true);
                    QueryParser parser = new QueryParser(Version.LUCENE_36,"fpath", new SimpleAnalyzer());
                    Query query = parser.parse(path);
                    query.setBoost((float) 1.5);
                    TopDocs topDocs = searcher.search(query, maxhits);
                    ScoreDoc[] hits = topDocs.scoreDocs;
                   len = hits.length;
                   JOptionPane.showMessageDialog(null,"items found"+len);

                 }
                catch(Exception ex)
               {
                 ex.printStackTrace();
              }

its showing the no of documents found as total no of document while the searched path file exists only once

You are analyzing the path, which will split it into separate terms. The root path term (like catalog in /catalog/products/versions ) likely occurs in all documents, so any search that includes catalog without forcing all terms to be mandatory will return all documents.

You need a search query like (using the example above):

+catalog +products +versions

to force all terms to be present.

Note that this gets more complicated if the same set of terms can occur in different orders, like:

/catalog/products/versions
/versions/catalog/products/SKUs

In that case, you need to use a different Lucene tokenizer than the tokenizer in the Standard Analyzer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM