简体   繁体   中英

How to index pdf file with lucene

i have to create a fulltext search with lucene in my project,so i have to index a blob column in mysql database(contains file pdf,doc,xsl,xml and image),with doc,xsl,and xml i dont have any problems but with the pdf file i cant get result

    public class Indexfile {
  public static void main(String[] args) throws Exception {

        RemoteControlServiceConnection a = new RemoteControlServiceConnection(
                "jdbc:mysql://localhost:3306/Test","root", "root" );
        Connection conn = a.getConnexionMySQL();
        final File INDEX_DIR = new File("index");
        IndexWriter writer = new IndexWriter(INDEX_DIR,
                new StandardAnalyzer(),
                true);

        String query = "SELECT id, name ,document FROM Table_document";
        Statement statement = conn.createStatement();
        ResultSet result = statement.executeQuery(query);

        while (result.next()) {
            Document document = new Document();
            document.add(new Field("id", result.getString("id"), Field.Store.YES, Field.Index.NO));
            document.add(new Field("name", result.getString("name"), Field.Store.YES, Field.Index.TOKENIZED));
            document.add(new Field("document", result.getString("document"), Field.Store.YES, Field.Index.TOKENIZED));
             writer.addDocument(text);
            }
        }

        writer.close();


    }
}

for search i use

    public class searchlucene {
    public static void main(String[] args) throws Exception {
    StandardAnalyzer analyzer = new StandardAnalyzer();
    String qu = "montbel*"; // put your keyword here
   // String IndexStoreDir = "index-directory";
    try {
        Query q = new QueryParser("document", analyzer).parse(qu);
        int hitspp = 100; //hits per page
        IndexSearcher searcher = new IndexSearcher(IndexReader.open("index"));
        TopDocCollector collector = new TopDocCollector(hitspp);
        searcher.search(q, collector);
        ScoreDoc[] hits = collector.topDocs().scoreDocs;
        System.out.println("Found " + hits.length + " hits.");
        for (int i = 0; i < hits.length; ++i) {
              int docId = hits[i].doc;
              Document d = searcher.doc(docId);
              System.out.println((i + 1) + ". " + d.get("name"));
          }
          searcher.close();
      } catch (Exception ex1) {
      }
}}

to Parse any kind of file use Tika project , then index it with Lucene. Tika already contain too many APIs (pdfBox....)

First You need to convert the PDF file content to text, then add that text to the index.

For Example:

You can use PDFBox to convert the pdf content to text:

String contents = "";
PDDocument doc = null;
try {
    doc = PDDocument.load(file);
    PDFTextStripper stripper = new PDFTextStripper();

    stripper.setLineSeparator("\n");
    stripper.setStartPage(1);
    stripper.setEndPage(5);// this mean that it will index the first 5 pages only
    contents = stripper.getText(doc);

} catch(Exception e){
    e.printStackTrace();
}

Then add the content to LuceneDocument , example:

luceneDoc.add(new Field(CONTENT_FIELD, allContents.toString(), Field.Store.NO, Field.Index.TOKENIZED));
    First you can read your pdf through itext just like
try{
        PdfReader readerObj = new PdfReader("file path");
            int n = readerObj.getNumberOfPages();
            String content=PdfTextExtractor.getTextFromPage(reader, 2); //Extracting the content from a particular page.
            document.close();
}catch(Exception e){
    e.printStackTrace();
}

    add your pdf content to lucene document
    doc.add(new Field("pdfContent", content, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM