如何使用Apache HWPF從DOC文件中提取文本和圖像

Question

我下載了Apache HWPF 。 我想用它來讀取doc文件並將其文本寫入純文本文件。 我不太了解HWPF。

我非常簡單的程序在這里：

我現在有3個問題：

一些包有錯誤（他們找不到apache hdf）。 我怎么能解決它們？
我如何使用HWDF的方法來查找和提取圖像？
我的某些程序不完整且不正確。 所以請幫我完成它。

我必須在2天內完成這個程序。

我再說一遍請請幫助我完成這個。

非常感謝你們的幫助！

這是我的基本代碼：

public class test {
  public void m1 (){
    String filesname = "Hello.doc";
    POIFSFileSystem fs = null;
    fs = new POIFSFileSystem(new FileInputStream(filesname ); 
    HWPFDocument doc = new HWPFDocument(fs);
    WordExtractor we = new WordExtractor(doc);
    String str = we.getText() ;
    String[] paragraphs = we.getParagraphText();
    Picture pic = new Picture(. . .) ;
    pic.writeImageContent( . . . ) ;
    PicturesTable picTable = new PicturesTable( . . . ) ;
    if ( picTable.hasPicture( . . . ) ){
      picTable.extractPicture(..., ...);
      picTable.getAllPictures() ;
    }
}

Answer 1

    //you can use the org.apache.poi.hwpf.extractor.WordExtractor to get the text
    String fileName = "example.doc";
    HWPFDocument wordDoc = new HWPFDocument(new FileInputStream(fileName));
    WordExtractor extractor = new WordExtractor(wordDoc);
    String[] text = extractor.getParagraphText();
    int lineCounter = text.length;
    String articleStr = ""; // This string object use to store text from the word document.
    for(int index = 0;index < lineCounter;++ index){
        String paragraphStr = text[index].replaceAll("\r\n","").replaceAll("\n","").trim();
        int paragraphLength = paragraphStr.length();
        if(paragraphLength != 0){
            articleStr.concat(paragraphStr);
        }
    }
    //you can use the org.apache.poi.hwpf.usermodel.Picture to get the image
    List<Picture> picturesList = wordDoc.getPicturesTable().getAllPictures();
    for(int i = 0;i < picturesList.size();++i){
        BufferedImage image = null;
        Picture pic = picturesList.get(i);
        image = ImageIO.read(new ByteArrayInputStream(pic.getContent()));
        if(image != null){
            System.out.println("Image["+i+"]"+" ImageWidth:"+image.getWidth()+" ImageHeight:"+image.getHeight()+" Suggest Image Format:"+pic.suggestFileExtension());
        }
    }

Answer 2

Apache Tika將為您完成此任務。 它處理與POI交談以執行HWPF操作，並為您提供XHTML或純文本以獲取文件內容。 如果您注冊一個遞歸解析器，那么您也將獲得所有嵌入的圖像。

Answer 3

我知道這很久以后，但我在谷歌代碼上發現了TextMining，更准確，更易於使用。 然而，它幾乎被遺棄的代碼。

Answer 4

如果你只是想這樣做，而你不關心編碼，你可以使用Antiword 。

$ antiword file.doc> out.txt

如何使用Apache HWPF從DOC文件中提取文本和圖像

問題描述

4 個解決方案

解決方案1
1 2014-11-07 01:56:55

解決方案2
1 2011-05-02 10:50:09

解決方案3
0 2011-05-01 05:47:17

解決方案4
0 2009-03-12 05:08:32

如何使用Apache HWPF從DOC文件中提取文本和圖像

問題描述

4 個解決方案

解決方案1 1 2014-11-07 01:56:55

解決方案2 1 2011-05-02 10:50:09

解決方案3 0 2011-05-01 05:47:17

解決方案4 0 2009-03-12 05:08:32

解決方案1
1 2014-11-07 01:56:55

解決方案2
1 2011-05-02 10:50:09

解決方案3
0 2011-05-01 05:47:17

解決方案4
0 2009-03-12 05:08:32