简体   繁体   English

优化PDF单词搜索

[英]Optimize PDF Word Search

I have an application that iterates over a directory of pdf files and searches for a string. 我有一个在pdf文件目录上进行迭代并搜索字符串的应用程序。 I am using PDFBox to extract the text from the PDF and the code is pretty straightforward. 我正在使用PDFBox从PDF中提取文本,并且代码非常简单。 At first to search through 13 files it was taking a minute in a half to load the results but I noticed that PDFBox was putting a lot of stuff in the log file file. 刚开始搜索13个文件时,要花一半的时间来加载结果,但是我注意到PDFBox在日志文件中放了很多东西。 I changed the logging level and that helped alot but it is still taking over 30 seconds to load a page. 我更改了日志记录级别,这很有帮助,但是加载页面仍需要30秒钟以上。 Does anybody have any suggestions on how I can optimize the code or another way to determine how many hits are in a document? 是否有人对我如何优化代码或确定文档中的匹配次数有其他建议? I played around with Lucene but it seems to only give you the number of hits in a directory not number of hits in a particular file. 我和Lucene一起玩,但它似乎只给您目录中的命中数,而不是特定文件中的命中数。

Here is my code to get the text out of a PDF. 这是我将代码从PDF中提取出来的代码。

public static String parsePDF (String filename) throws IOException 
 {

    FileInputStream fi = new FileInputStream(new File(filename));       

    PDFParser parser = new PDFParser(fi);   
    parser.parse();   
    COSDocument cd = parser.getDocument();   
    PDFTextStripper stripper = new PDFTextStripper();   
    String pdfText = stripper.getText(new PDDocument(cd));  

    cd.close();

    return pdfText;
 }

Lucene would allow you to index each of the document seperately. Lucene允许您单独索引每个文档。
Instead of using PDFBox directly. 而不是直接使用PDFBox。 you can use Apache Tika for extracting text and feeding it to lucene. 您可以使用Apache Tika提取文本并将其提供给Lucene。 Tika uses PDFBox internally. Tika在内部使用PDFBox。 However, it provides easy to use api as well as ability to extract content from any types of document seamlessly. 但是,它提供了易于使用的api,并且能够无缝地从任何类型的文档中提取内容。
Once you have each lucene document for each of the file in your directory, you can perform search against the complete index. 在目录中拥有每个文件的每个Lucene文档后,就可以对完整索引进行搜索。
Lucene matches the search term and would return back number of results (files) which match the content in the document. Lucene匹配搜索词,并返回与文档内容匹配的结果(文件)数。
It is also possible to get the hits in each of the lucene document/file using the lucene api. 还可以使用Lucene API在每个Lucene文档/文件中获得匹配。 This is called the term frequency, and can be calculated for the document and field being searched upon. 这称为频率,可以针对要搜索的文档和字段进行计算。

Example from In a Lucene / Lucene.net search, how do I count the number of hits per document? 来自Lucene / Lucene.net搜索中的示例,如何计算每个文档的点击数?

List docIds = // doc ids for documents that matched the query, 
              // sorted in ascending order 

int totalFreq = 0;
TermDocs termDocs = reader.termDocs();
termDocs.seek(new Term("my_field", "congress"));
for (int id : docIds) {
    termDocs.skipTo(id);
    totalFreq += termDocs.freq();
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM