简体   繁体   English

如何使用Apache POI读取Java中的.DOC文件以将图像与文本分开?

[英]How do I use Apache POI to read a .DOC file in Java to separate images from text?

I need to read a Word .doc file from Java that has text and images. 我需要从Java中读取包含文本和图像的Word .doc文件。 I need to recognize the images & text and separate them into 2 files. 我需要识别图像和文本并将它们分成2个文件。

I've recently heard about "Apache POI." 我最近听说过“Apache POI”。 How I can use Apache POI to read Word .doc files? 如何使用Apache POI读取Word .doc文件?

The examples and sample code on apache's site are pretty good. apache网站上的示例和示例代码非常好。 I recommend you start there. 我建议你从那里开始。

http://poi.apache.org/hwpf/quick-guide.html http://poi.apache.org/hwpf/quick-guide.html

To get specific bits of text, first create a org.apache.poi.hwpf.HWPFDocument. 要获取特定的文本位,首先要创建一个org.apache.poi.hwpf.HWPFDocument。 Fetch the range with getRange(), then get paragraphs from that. 使用getRange()获取范围,然后从中获取段落。 You can then get text and other properties. 然后,您可以获得文本和其他属性。

Here for an example of extracting an image. 这里是提取图像的示例。 Here for the latest revision as of this writing. 这里是撰写本文时的最新修订版。

And of course, the Javadocs 当然还有Javadocs

Note that, according to the POI site, 请注意,根据POI网站,

HWPF is still in early development. HWPF仍处于早期开发阶段。

It's not free (or even cheap!) but Aspose.Words should be able to do this. 它不是免费的(甚至便宜!)但是Aspose.Words应该能够做到这一点。 Their evaluation download will let you play with small files. 他们的评估下载将让你玩小文件。

Do the destination files also have to be Docs? 目标文件也必须是文档吗? You could open the docs in Office and save them out as HTML. 您可以在Office中打开文档并将其另存为HTML。 Then the separation becomes trivial. 然后分离变得微不足道。 RTF is also a viable option, but I can't recommend a good RTF parser off the top of my head. RTF也是一个可行的选择,但我不能推荐一个好的RTF解析器。

Edit to say: I just remembered another possible solution: Jacob , but you'll need an instance of Office running on the same machine. 编辑说:我只记得另一个可能的解决方案: 雅各布 ,但你需要在同一台机器上运行Office的实例。 It's short for Java COM Bridge and it lets you make calls to the COM libraries in Office to manipulate the documents. 它是Java COM Bridge的缩写,它允许您调用Office中的COM库来操作文档。 I'm sure it's not as scary as it might sound! 我敢肯定它并不像听起来那么可怕!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM