简体繁体中英

How do I use Apache POI to read a .DOC file in Java to separate images from text?

原文 2009-02-28 05:41:40 8 2 java/ ms-word/ apache-poi

I need to read a Word .doc file from Java that has text and images. I need to recognize the images & text and separate them into 2 files.

I've recently heard about "Apache POI." How I can use Apache POI to read Word .doc files?

2 answers

The examples and sample code on apache's site are pretty good. I recommend you start there.

http://poi.apache.org/hwpf/quick-guide.html

To get specific bits of text, first create a org.apache.poi.hwpf.HWPFDocument. Fetch the range with getRange(), then get paragraphs from that. You can then get text and other properties.

Here for an example of extracting an image. Here for the latest revision as of this writing.

And of course, the Javadocs

Note that, according to the POI site,

HWPF is still in early development.

It's not free (or even cheap!) but Aspose.Words should be able to do this. Their evaluation download will let you play with small files.

Do the destination files also have to be Docs? You could open the docs in Office and save them out as HTML. Then the separation becomes trivial. RTF is also a viable option, but I can't recommend a good RTF parser off the top of my head.

Edit to say: I just remembered another possible solution: Jacob , but you'll need an instance of Office running on the same machine. It's short for Java COM Bridge and it lets you make calls to the COM libraries in Office to manipulate the documents. I'm sure it's not as scary as it might sound!

How can I extract right-to-left text from .doc and .docx files using Apache POI in java?

Enter text to a Table Cell in a Doc file using apache poi in java

How to use Apache HWPF to extract text and images out of a DOC file

How to read doc file for the first four paragraph using Apache POI?

Java: Apache POI: Can I get clean text from MS Word (.doc) files?

How to extract text from .doc document using apache poi?

How to read doc and docx file in java with POI api

how to know whether a file is .docx or .doc format from Apache POI

Apache POI :- Get Headings from DOC file

How to decrypt a .doc/docx file with Apache POI?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How can I extract right-to-left text from .doc and .docx files using Apache POI in java? Enter text to a Table Cell in a Doc file using apache poi in java How to use Apache HWPF to extract text and images out of a DOC file How to read doc file for the first four paragraph using Apache POI? Java: Apache POI: Can I get clean text from MS Word (.doc) files? How to extract text from .doc document using apache poi? How to read doc and docx file in java with POI api how to know whether a file is .docx or .doc format from Apache POI Apache POI :- Get Headings from DOC file How to decrypt a .doc/docx file with Apache POI?

Related Tags

How do I use Apache POI to read a .DOC file in Java to separate images from text?

Question

2 answers

solution1
13

solution2
1 2009-02-28 07:34:22

How do I use Apache POI to read a .DOC file in Java to separate images from text?

Question

2 answers

solution1 13

solution2 1 2009-02-28 07:34:22

solution1
13

solution2
1 2009-02-28 07:34:22