简体繁体 English

开源Java文本解析器

[英]Open Source Java Text Parsers

原文 2011-06-22 17:48:11 0 2 java/ pdf/ ms-office/ openoffice.org/ text-parsing

Is there a single Java text parser which can be used to parse Office (windows) documents, OpenOffice documents, and PDFs as well? 有没有可以用来解析办事处（窗口）的文件，OpenOffice的文件和PDF以及一个 Java文本解析器？ Else do I need to use something like Apache POI for Word documents and other libraries for OpenOffice and PDFs? 还需要为Word文档使用诸如Apache POI之类的东西，为OpenOffice和PDF使用其他库吗？ If so what are the best options for OpenOffice and PDFs? 如果是这样，那么OpenOffice和PDF的最佳选择是什么？

2 个解决方案

If the task is reading PDF documents, iText is your best bet. 如果任务是阅读PDF文档，那么iText是最好的选择。 For Microsoft Office and OpenOffice (LibreOffice) based documents, POI would be my solution. 对于基于Microsoft Office和OpenOffice（LibreOffice）的文档，POI将是我的解决方案。

Apache Tika : Apache Tika ：

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries. Apache Tika™工具箱使用现有的解析器库从各种文档中检测并提取元数据和结构化文本内容。

Not sure whether this qualifies as "single" for your purposes. 不知道这是否符合您的目的。