How to index pdf file with lucene

Question

i have to create a fulltext search with lucene in my project,so i have to index a blob column in mysql database(contains file pdf,doc,xsl,xml and image),with doc,xsl,and xml i dont have any problems but with the pdf file i cant get result

    public class Indexfile {
  public static void main(String[] args) throws Exception {

        RemoteControlServiceConnection a = new RemoteControlServiceConnection(
                "jdbc:mysql://localhost:3306/Test","root", "root" );
        Connection conn = a.getConnexionMySQL();
        final File INDEX_DIR = new File("index");
        IndexWriter writer = new IndexWriter(INDEX_DIR,
                new StandardAnalyzer(),
                true);

        String query = "SELECT id, name ,document FROM Table_document";
        Statement statement = conn.createStatement();
        ResultSet result = statement.executeQuery(query);

        while (result.next()) {
            Document document = new Document();
            document.add(new Field("id", result.getString("id"), Field.Store.YES, Field.Index.NO));
            document.add(new Field("name", result.getString("name"), Field.Store.YES, Field.Index.TOKENIZED));
            document.add(new Field("document", result.getString("document"), Field.Store.YES, Field.Index.TOKENIZED));
             writer.addDocument(text);
            }
        }

        writer.close();


    }
}

for search i use

    public class searchlucene {
    public static void main(String[] args) throws Exception {
    StandardAnalyzer analyzer = new StandardAnalyzer();
    String qu = "montbel*"; // put your keyword here
   // String IndexStoreDir = "index-directory";
    try {
        Query q = new QueryParser("document", analyzer).parse(qu);
        int hitspp = 100; //hits per page
        IndexSearcher searcher = new IndexSearcher(IndexReader.open("index"));
        TopDocCollector collector = new TopDocCollector(hitspp);
        searcher.search(q, collector);
        ScoreDoc[] hits = collector.topDocs().scoreDocs;
        System.out.println("Found " + hits.length + " hits.");
        for (int i = 0; i < hits.length; ++i) {
              int docId = hits[i].doc;
              Document d = searcher.doc(docId);
              System.out.println((i + 1) + ". " + d.get("name"));
          }
          searcher.close();
      } catch (Exception ex1) {
      }
}}

Answer 1

to Parse any kind of file use Tika project , then index it with Lucene. Tika already contain too many APIs (pdfBox....)

Answer 2

First You need to convert the PDF file content to text, then add that text to the index.

For Example:

You can use PDFBox to convert the pdf content to text:

String contents = "";
PDDocument doc = null;
try {
    doc = PDDocument.load(file);
    PDFTextStripper stripper = new PDFTextStripper();

    stripper.setLineSeparator("\n");
    stripper.setStartPage(1);
    stripper.setEndPage(5);// this mean that it will index the first 5 pages only
    contents = stripper.getText(doc);

} catch(Exception e){
    e.printStackTrace();
}

Then add the content to LuceneDocument , example:

luceneDoc.add(new Field(CONTENT_FIELD, allContents.toString(), Field.Store.NO, Field.Index.TOKENIZED));

Answer 3

    First you can read your pdf through itext just like
try{
        PdfReader readerObj = new PdfReader("file path");
            int n = readerObj.getNumberOfPages();
            String content=PdfTextExtractor.getTextFromPage(reader, 2); //Extracting the content from a particular page.
            document.close();
}catch(Exception e){
    e.printStackTrace();
}

    add your pdf content to lucene document
    doc.add(new Field("pdfContent", content, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));

How to index pdf file with lucene

Question

3 answers

solution1
6 2014-05-21 15:06:32

solution2
1 ACCPTED 2014-05-20 14:16:21

solution3
0 2014-05-21 07:17:34

How to index pdf file with lucene

Question

3 answers

solution1 6 2014-05-21 15:06:32

solution2 1 ACCPTED 2014-05-20 14:16:21

solution3 0 2014-05-21 07:17:34

solution1
6 2014-05-21 15:06:32

solution2
1 ACCPTED 2014-05-20 14:16:21

solution3
0 2014-05-21 07:17:34