简体   繁体   English

如何使用Apache POI从PDF中提取原始文本?

[英]How can I extract raw text from PDFs using Apache POI?

I need to extract raw text from several files, some of which are PDF and some of which are DOC file formats. 我需要从几个文件中提取原始文本,其中一些是PDF,其中一些是DOC文件格式。

I have to use Apache POI to do this. 我必须使用Apache POI来执行此操作。 Now, there is a lot of documentation I have found on dealing with word files (extracting and writing to etc.) but I am unable to find any documentation on extracting from a PDF. 现在,我在处理word文件(提取和写入等)时发现了很多文档,但是我无法找到有关从PDF中提取的任何文档。

Am I wrong in believing that Apache POI has this capability? 我认为Apache POI具有此功能是错误的吗?

If so, can anyone recommend similar Java programs that allow text extraction from multiple file formats? 如果是这样,有人可以推荐类似的Java程序,允许从多种文件格式中提取文本吗?

If not, can anyone point me to the documentation and/or the classes/methods that I should be looking at to do this? 如果没有,有人能指出我应该看到的文档和/或类/方法吗?

Thank you in advance for any help. 预先感谢您的任何帮助。

Yes, you are wrong in believing that POI will do that. 是的,你认为POI会这样做是错误的。 Apache POI works with Microsoft Office file formats, which PDF isn't. Apache POI适用于Microsoft Office文件格式,而PDF不适用。

You'll either want to use Apache PDFBox directly, or us Apache Tika which will do both Microsoft Office and PDF file formats (amongst many others). 您要么直接使用Apache PDFBox ,要么使用Apache Tika ,它将同时使用Microsoft Office和PDF文件格式(以及许多其他格式)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM