简体繁体 English

如何使用Apache POI从PDF中提取原始文本？

[英]How can I extract raw text from PDFs using Apache POI?

原文 2013-06-04 05:55:50 3 1 java/ pdf/ apache-poi/ text-extraction

I need to extract raw text from several files, some of which are PDF and some of which are DOC file formats. 我需要从几个文件中提取原始文本，其中一些是PDF，其中一些是DOC文件格式。

I have to use Apache POI to do this. 我必须使用Apache POI来执行此操作。 Now, there is a lot of documentation I have found on dealing with word files (extracting and writing to etc.) but I am unable to find any documentation on extracting from a PDF. 现在，我在处理word文件（提取和写入等）时发现了很多文档，但是我无法找到有关从PDF中提取的任何文档。

Am I wrong in believing that Apache POI has this capability? 我认为Apache POI具有此功能是错误的吗？

If so, can anyone recommend similar Java programs that allow text extraction from multiple file formats? 如果是这样，有人可以推荐类似的Java程序，允许从多种文件格式中提取文本吗？

If not, can anyone point me to the documentation and/or the classes/methods that I should be looking at to do this? 如果没有，有人能指出我应该看到的文档和/或类/方法吗？

Thank you in advance for any help. 预先感谢您的任何帮助。

1 个解决方案

Yes, you are wrong in believing that POI will do that. 是的，你认为POI会这样做是错误的。 Apache POI works with Microsoft Office file formats, which PDF isn't. Apache POI适用于Microsoft Office文件格式，而PDF不适用。

You'll either want to use Apache PDFBox directly, or us Apache Tika which will do both Microsoft Office and PDF file formats (amongst many others). 您要么直接使用Apache PDFBox ，要么使用Apache Tika ，它将同时使用Microsoft Office和PDF文件格式（以及许多其他格式）。

如何使用Java中的Apache POI从.doc和.docx文件中提取从右到左的文本？ - How can I extract right-to-left text from .doc and .docx files using Apache POI in java?

如何使用Apache POI从.doc文档中提取文本？ - How to extract text from .doc document using apache poi?

如何使用 XPATH 或 Apache POI 从 XML 过滤水印文本？ - How can I Filter watermark text from XML using XPATH or Apache POI?

如何使用apache poi从ppt，pptx文件（页脚，幻灯片编号）中提取文本？ - how to extract text from ppt, pptx file except footer, slide number using apache poi?

如何使用 Apache POI 从 OOXML 中提取字体系列？ - How to extract font family from OOXML using Apache POI?

使用Java从多个PDF中提取文本 - Extract text from multiple PDFs using Java

如何从PDF中提取图像及其元数据？ - How can I extract images and their metadata from PDFs?

如何使用 pentaho 从 pdf 中提取文本？ - How to extract text from pdfs with pentaho?

如何使用Apache Poi从字形读取文本 - How to read the text from shapes in word using Apache poi

如何使用 Apache POI 从 .docx 文件中检索水印文本？ - How to retrieve watermark text from .docx file using Apache POI?

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用Java中的Apache POI从.doc和.docx文件中提取从右到左的文本？ - How can I extract right-to-left text from .doc and .docx files using Apache POI in java? 如何使用Apache POI从.doc文档中提取文本？ - How to extract text from .doc document using apache poi? 如何使用 XPATH 或 Apache POI 从 XML 过滤水印文本？ - How can I Filter watermark text from XML using XPATH or Apache POI? 如何使用apache poi从ppt，pptx文件（页脚，幻灯片编号）中提取文本？ - how to extract text from ppt, pptx file except footer, slide number using apache poi? 如何使用 Apache POI 从 OOXML 中提取字体系列？ - How to extract font family from OOXML using Apache POI? 使用Java从多个PDF中提取文本 - Extract text from multiple PDFs using Java 如何从PDF中提取图像及其元数据？ - How can I extract images and their metadata from PDFs? 如何使用 pentaho 从 pdf 中提取文本？ - How to extract text from pdfs with pentaho? 如何使用Apache Poi从字形读取文本 - How to read the text from shapes in word using Apache poi 如何使用 Apache POI 从 .docx 文件中检索水印文本？ - How to retrieve watermark text from .docx file using Apache POI?

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM