如何在Lucene 3.0.2中索引和搜索文本文件？

Question

我是Lucene的新手，我在創建簡單的代碼來查詢文本文件集時遇到了一些問題。

我試過這個例子，但是與新版本的Lucene不兼容。

UDPATE： 這是我的新代碼，但它仍然無法正常工作。

Answer 1

Lucene是一個非常大的主題，有很多類和方法需要覆蓋，如果不了解至少一些基本概念，通常不能使用它。 如果您需要快速可用的服務，請改用Solr 。 如果您需要完全控制Lucene，請繼續閱讀。 我將介紹一些代表它們的核心Lucene概念和類。 （有關如何在內存中讀取讀取文本文件，例如，信息這個文章）。

無論你在Lucene做什么 - 索引或搜索 - 你都需要一台分析儀。 分析器的目標是將輸入文本標記化（分成單詞）和詞干（獲得單詞的基礎）。 它還會拋出最常用的單詞，如“a”，“the”等。您可以找到超過20種語言的分析器，或者您可以使用SnowballAnalyzer並將語言作為參數傳遞。
要為英語創建SnowballAnalyzer實例，請執行以下操作：

Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");

如果您要使用不同語言索引文本，並希望自動選擇分析器，則可以使用tika的LanguageIdentifier 。

您需要將索引存儲在某處。 這有兩個主要可能性：內存索引（易於嘗試）和磁盤索引（最常見的索引）。
使用接下來的兩行中的任何一行：

Directory directory = new RAMDirectory();   // RAM index storage
Directory directory = FSDirectory.open(new File("/path/to/index"));  // disk index storage

如果要添加，更新或刪除文檔，則需要IndexWriter：

IndexWriter writer = new IndexWriter(directory, analyzer, true, new IndexWriter.MaxFieldLength(25000));

任何文檔（在您的情況下為文本文件）都是一組字段。 要創建包含文件信息的文檔，請使用以下命令：

Document doc = new Document();
String title = nameOfYourFile;
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));  // adding title field
String content = contentsOfYourFile;
doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED)); // adding content field
writer.addDocument(doc);  // writing new document to the index

Field構造函數采用字段名稱，文本和至少 2個參數。 首先是一個標志，顯示Lucene是否必須存儲此字段。 如果它等於Field.Store.YES您將有可能從索引中獲取所有文本，否則將只存儲有關它的索引信息。
第二個參數顯示Lucene是否必須索引此字段。 對要搜索的任何字段使用Field.Index.ANALYZED 。
通常，您使用上述兩個參數。

作業完成后，不要忘記關閉IndexWriter ：

writer.close();

搜索有點棘手。 您將需要幾個類： Query和QueryParser從字符串中進行Lucene查詢，使用IndexSearcher進行實際搜索，使用TopScoreDocCollector存儲結果（將其作為參數傳遞給IndexSearcher ）和ScoreDoc來迭代結果。 下一個片段顯示了這一切是如何組成的：

IndexSearcher searcher = new IndexSearcher(directory);
QueryParser parser = new QueryParser(Version.LUCENE_30, "content", analyzer);
Query query = parser.parse("terms to search");
TopScoreDocCollector collector = TopScoreDocCollector.create(HOW_MANY_RESULTS_TO_COLLECT, true);
searcher.search(query, collector);

ScoreDoc[] hits = collector.topDocs().scoreDocs;
// `i` is just a number of document in Lucene. Note, that this number may change after document deletion 
for (int i = 0; i < hits.length; i++) {
    Document hitDoc = searcher.doc(hits[i].doc);  // getting actual document
    System.out.println("Title: " + hitDoc.get("title"));
    System.out.println("Content: " + hitDoc.get("content"));
    System.out.println();
}

注意QueryParser構造函數的第二個參數 - 它是默認字段，即如果沒有給出限定符則將搜索的字段。 例如，如果您的查詢是“title：term”，Lucene將在所有文檔的字段“title”中搜索單詞“term”，但如果您的查詢只是“term”，則會在默認字段中搜索，在這種情況下 - “內容”。 有關更多信息，請參閱Lucene查詢語法。
QueryParser還將分析器作為最后一個參數。 這必須與您用於索引文本的分析器相同。

您必須知道的最后一件事是TopScoreDocCollector.create第一個參數。 它只是一個數字，表示您要收集的結果數量。 例如，如果它等於100，Lucene將僅收集第一個（按分數）100個結果並刪除其余部分。 這只是一種優化行為 - 您收集最佳結果，如果您對此不滿意，則重復搜索更大的數字。

最后，不要忘記關閉搜索器和目錄，以免丟失系統資源：

searcher.close();
directory.close();

編輯：另請參閱Lucene 3.0源代碼中的 IndexFiles演示類。

Answer 2

package org.test;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;


import org.apache.lucene.queryParser.*;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

public class LuceneSimple {

 private static void addDoc(IndexWriter w, String value) throws IOException {
  Document doc = new Document();
  doc.add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
  w.addDocument(doc);
 }



 public static void main(String[] args) throws CorruptIndexException, LockObtainFailedException, IOException, ParseException {

     File dir = new File("F:/tmp/dir");

  StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);

  Directory index = new RAMDirectory();
  //Directory index = FSDirectory.open(new File("lucDirHello") );


  IndexWriter w = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);

  w.setRAMBufferSizeMB(200);

  System.out.println(index.getClass() + " RamBuff:" + w.getRAMBufferSizeMB() );

  addDoc(w, "Lucene in Action");
     addDoc(w, "Lucene for Dummies");
     addDoc(w, "Managing Gigabytes");
     addDoc(w, "The Art of Computer Science");
     addDoc(w, "Computer Science ! what is that ?");


     Long N = 0l;

     for( File f : dir.listFiles() ){
      BufferedReader br = new BufferedReader( new FileReader(f) );
      String line = null;
      while( ( line = br.readLine() ) != null ){
       if( line.length() < 140 ) continue;      
       addDoc(w, line);
       ++N;
      }
      br.close();
     }

     w.close();

     // 2. query
     String querystr = "Computer";

     Query q = new QueryParser( Version.LUCENE_30, "title", analyzer ).parse(querystr);


     //search
     int hitsPerPage = 10;

     IndexSearcher searcher = new IndexSearcher(index, true);

     TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);

     searcher.search(q, collector);

     ScoreDoc[] hits = collector.topDocs().scoreDocs;

     System.out.println("Found " + hits.length + " hits.");
     for(int i=0;i<hits.length;++i) {
       int docId = hits[i].doc;
       Document d = searcher.doc(docId);
       System.out.println((i + 1) + ". " + d.get("title"));
     }


     searcher.close();

 }

}

Answer 3

我建議你看看Solr @ http://lucene.apache.org/solr/，而不是使用lucene api

如何在Lucene 3.0.2中索引和搜索文本文件？

問題描述

3 個解決方案

解決方案1
34 已采納 2010-11-03 22:19:42

解決方案2
3 2010-11-03 22:07:32

解決方案3
1 2010-11-03 20:41:13

如何在Lucene 3.0.2中索引和搜索文本文件？

問題描述

3 個解決方案

解決方案1 34 已采納 2010-11-03 22:19:42

解決方案2 3 2010-11-03 22:07:32

解決方案3 1 2010-11-03 20:41:13

解決方案1
34 已采納 2010-11-03 22:19:42

解決方案2
3 2010-11-03 22:07:32

解決方案3
1 2010-11-03 20:41:13