简体   繁体   English

实现搜索文档(PDF,XML,HTML,MS Word)的最佳方法是什么?

[英]What is the best approach to implement search for searching documents (PDF, XML, HTML, MS Word)?

What could be a good way to code a search functionality for searching documents in a java web application? 在Java Web应用程序中编码搜索功能以搜索文档的一种好方法是什么?

Is 'tagged search' a good fit for such kind of search functionality? “标记搜索”是否适合此类搜索功能?

Why re-invent the wheel? 为什么要重新发明轮子?

Check out Apache Lucene . 查看Apache Lucene

Also, search Stack Overflow for "full text search" and you'll find a lot of other very similar questions. 另外,在Stack Overflow中搜索“全文本搜索”,您还会发现很多其他非常相似的问题。 Here's another one, for example: How do I implement Search Functionality in a website? 例如,这是另一个: 如何在网站中实现搜索功能?

You could use Solr which sits on top of Lucene, and is a real web search engine application, while the Lucene is a library. 您可以使用位于Lucene之上的Solr ,它是一个真正的Web搜索引擎应用程序,而Lucene是一个库。 However neither Solr or Lucene parse the Word document, pdf, etc. to extract meta data information. 但是,Solr或Lucene都不会解析Word文档,pdf等来提取元数据信息。 It's necessary to index the document based on a pre-defined document schema. 必须基于预定义的文档架构对文档建立索引。

至于提取Office文档的文本内容(在将其提供给Lucene之前需要做的),有一个Apache Tika项目,它支持许多文件格式 ,包括Microsoft的文件格式

Just for updating 仅用于更新

There is another alternative instead of Solr, called " ElasticSearch ", its a project with good capabilities, similar to Solr, but schemaless. 还有一个替代Solr的替代方案,称为“ ElasticSearch ”,它是一个具有良好功能的项目,类似于Solr,但是没有模式。

Both projecs are build on top of Lucene. 这两个项目都建立在Lucene之上。

Using Tika, the code to get the text from a file is quite simple: 使用Tika,从文件获取文本的代码非常简单:

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.Parser;

// exception handling not shown
Parser parser = new AutoDetectParser();
StringWriter textBuffer = new StringWriter();
InputStream input = new FileInputStream(file);
Metadata md = new Metadata();
md.set(Metadata.RESOURCE_NAME_KEY, file.getName());
parser.parse(input, new BodyContentHandler(textBuffer), md);
String text = textBuffer.toString()

So far, Tika 0.3 seems to work great. 到目前为止,Tika 0.3似乎运行良好。 Just throw any file at it and it will give you back what makes the most sense for that format. 只需将任何文件丢给它,它就会带给您哪种格式最有意义的信息。 I can get the text for indexing of anything I've thrown at it so far, including PDF's and the new MS Office files. 到目前为止,我可以得到我用来投稿的所有内容的索引文本,包括PDF和新的MS Office文件。 If there are problems with some formats, I believe they mainly lie in getting formatted text extraction rather than just raw plaintext. 如果某些格式存在问题,我认为它们主要在于获取格式化的文本,而不仅仅是原始明文。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM