我需要将Apache POI图片从word文档转换为html文件

Question

I have some code that uses the Java Apache POI library to open a Microsoft word document and convert it to html, using the the Apache POI and it also gets the byte array data of images on the document. 我有一些代码使用Java Apache POI库来打开Microsoft Word文档并使用Apache POI将其转换为html，它还获取文档上图像的字节数组数据。 But I need to convert this information to html to write out to an html file. 但我需要将此信息转换为html以写入html文件。 Any hints or suggestions would be appreciated. 任何提示或建议将不胜感激。 Keep in mind that I am a desktop dev developer and not a web programmer, so when you make suggestions, please remember that. 请记住，我是桌面开发人员，而不是网络程序员，所以当你提出建议时，请记住这一点。 The code below gets the image. 下面的代码获取图像。

 private void parseWordText(File file) throws IOException {
      FileInputStream fs = new FileInputStream(file);
      doc = new HWPFDocument(fs);
      PicturesTable picTable = doc.getPicturesTable();
      if (picTable != null){
           picList = new ArrayList<Picture>(picTable.getAllPictures());
           if (!picList.isEmpty()) {
           for (Picture pic : picList) {
                byte[] byteArray = pic.getContent();
                pic.suggestFileExtension();
                pic.suggestFullFileName();
                pic.suggestPictureType();
                pic.getStartOffset();
           }
        }
     }

Then the code below this converts the document to html. 然后下面的代码将文档转换为html。 Is there a way to add the byteArray to the ByteArrayOutputStream in the code below? 有没有办法在下面的代码中将byteArray添加到ByteArrayOutputStream？

private void convertWordDoctoHTML(File file) throws ParserConfigurationException, TransformerConfigurationException, TransformerException, IOException {
    HWPFDocumentCore wordDocument = null;
    try {
        wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream(file));
    } catch (IOException ex) {
        Exceptions.printStackTrace(ex);
    }

    WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
    wordToHtmlConverter.processDocument(wordDocument);
    org.w3c.dom.Document htmlDocument = wordToHtmlConverter.getDocument();
    NamedNodeMap node = htmlDocument.getAttributes();


    ByteArrayOutputStream out = new ByteArrayOutputStream();
    DOMSource domSource = new DOMSource(htmlDocument);
    StreamResult streamResult = new StreamResult(out);

    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer serializer = tf.newTransformer();
    serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    serializer.setOutputProperty(OutputKeys.INDENT, "yes");
    serializer.setOutputProperty(OutputKeys.METHOD, "html");
    serializer.transform(domSource, streamResult);
    out.close();

    String result = new String(out.toByteArray());
    acDocTextArea.setText(newDocText);

    htmlText = result;

}

Answer 1

Looking at the source code for the org.apache.poi.hwpf.converter.WordToHtmlConverter at 查看org.apache.poi.hwpf.converter.WordToHtmlConverter的源代码

http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/WordToHtmlConverter.java?view=markup&pathrev=1180740 http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/WordToHtmlConverter.java?view=markup&pathrev=1180740

It states in the JavaDoc: 它在JavaDoc中声明：

This implementation doesn't create images or links to them. 此实现不会创建图像或链接。 This can be changed by overriding {@link #processImage(Element, boolean, Picture)} method 这可以通过覆盖{@link #processImage（Element，boolean，Picture）}方法来改变

If you take a look at that processImage(...) method in AbstractWordConverter.java at line 790, it looks like the method is calling then another method named processImageWithoutPicturesManager(...) . 如果您在第790行查看AbstractWordConverter.java中的processImage(...)方法，看起来该方法正在调用另一个名为processImageWithoutPicturesManager(...) 。

http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/AbstractWordConverter.java?view=markup&pathrev=1180740 http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/AbstractWordConverter.java?view=markup&pathrev=1180740

This method is defined in WordToHtmlConverter again and looks suspiciously exact like the place you want to grow your code (line 317): 此方法再次在WordToHtmlConverter定义，看起来非常精确，就像您想要增长代码的地方一样（第317行）：

@Override
protected void processImageWithoutPicturesManager(Element currentBlock,
    boolean inlined, Picture picture)
{
    // no default implementation -- skip
    currentBlock.appendChild(htmlDocumentFacade.document
    .createComment("Image link to '"
    + picture.suggestFullFileName() + "' can be here"));
}

I think you have the point where to start inserting the images into the flow. 我认为你有可能开始将图像插入到流程中。

Create a subclass of the converter, eg 创建转换器的子类，例如

    public class InlineImageWordToHtmlConverter extends WordToHtmlConverter

and then override the method and place whatever code into it. 然后覆盖该方法并将任何代码放入其中。

I haven't tested it, but it should be the right way from what I see theoretically. 我没有对它进行测试，但它应该是我理论上看到的正确方法。

Answer 2

@user4887078 It's straight forward just as @Guga said, all I did was to look org.apache.poi.xwpf.converter.core.FileImageExtractor and Voila! @ user4887078正如@Guga说的那样直截了当，我所做的只是看看org.apache.poi.xwpf.converter.core.FileImageExtractor和Voila！ It sure works as expected, although it might still need some refactoring and optimization. 它确实按预期工作，虽然它可能仍需要一些重构和优化。

HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(is);

            WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
                    DocumentBuilderFactory.newInstance().newDocumentBuilder()
                            .newDocument());
            wordToHtmlConverter.setPicturesManager(new PicturesManager() {
                @Override
                public String savePicture(byte[] bytes, PictureType pictureType, String s, float v, float v1) {
                    File imageFile = new File("pages/imgs", s);
                    imageFile.getParentFile().mkdirs();
                    InputStream in = null;
                    FileOutputStream out = null;

                    try {
                        in = new ByteArrayInputStream(bytes);
                        out = new FileOutputStream(imageFile);
                        IOUtils.copy(in, out);

                    } catch (FileNotFoundException e) {
                        e.printStackTrace();
                    } catch (IOException e) {
                        e.printStackTrace();
                    } finally {
                        if (in != null) {
                            IOUtils.closeQuietly(in);
                        }

                        if (out != null) {
                            IOUtils.closeQuietly(out);
                        }

                    }
                    return "imgs/" + imageFile.getName();
                }
            });
            wordToHtmlConverter.processDocument(wordDocument);
            Document htmlDocument = wordToHtmlConverter.getDocument();
            ByteArrayOutputStream out = new ByteArrayOutputStream();
            DOMSource domSource = new DOMSource(htmlDocument);
            StreamResult streamResult = new StreamResult(out);


            Transformer transformer = TransformerFactory.newInstance().newTransformer();
            transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
            transformer.setOutputProperty(OutputKeys.INDENT, "yes");
            transformer.setOutputProperty(OutputKeys.METHOD, "html");
            transformer.transform(domSource, streamResult);
            out.close();

            String result = new String(out.toByteArray());
            FileOutputStream fos = new FileOutputStream(outFile);

Answer 3

Use this should be useful. 使用它应该是有用的。

public class InlineImageWordToHtmlConverter extends WordToHtmlConverter{
    public InlineImageWordToHtmlConverter(Document document) {
        super(document);
    } 

    @Override
    protected void processImageWithoutPicturesManager(Element currentBlock, boolean inlined, Picture picture) {
        Element img = super.getDocument().createElement("img");
        img.setAttribute("src", "data:image/png;base64,"+Base64.getEncoder().encodeToString(picture.getContent()));
        currentBlock.appendChild(img);
    }
}

我需要将Apache POI图片从word文档转换为html文件

问题描述

3 个解决方案

解决方案1
3 已采纳 2012-10-31 16:18:08

解决方案2
0 2017-09-07 12:18:30

解决方案3
0 2017-12-22 01:42:56

我需要将Apache POI图片从word文档转换为html文件

问题描述

3 个解决方案

解决方案1 3 已采纳 2012-10-31 16:18:08

解决方案2 0 2017-09-07 12:18:30

解决方案3 0 2017-12-22 01:42:56

解决方案1
3 已采纳 2012-10-31 16:18:08

解决方案2
0 2017-09-07 12:18:30

解决方案3
0 2017-12-22 01:42:56