如何從 apache POI XWPFDocument 中獲取圖紙？

Question

我試圖通過這種方式從 XWPFDocument 中獲取圖紙（我的 data.docx 只包含一個矩形，它是文本）。

    XWPFDocument wordDocumentObj = new XWPFDocument(new FileInputStream(new File("data.docx")));
    Iterator<IBodyElement> bodyElementIterator = wordDocumentObj.getBodyElementsIterator();

    while(bodyElementIterator.hasNext()){
        IBodyElement element = bodyElementIterator.next();
        if (element instanceof XWPFParagraph) {
             XWPFParagraph paragrapObj = (XWPFParagraph)element;
             for(IRunElement irunObj : paragrapObj.getIRuns()) {
                 XWPFRun runObj = (XWPFRun)irunObj;
                 // I read whole the API doc, I think it is the only way to get the drawings
                 System.out.println(runObj.getCTR().getDrawingList());// No element returned
                 System.out.println(runObj.getCTR().getDrawingArray());// No element returned
             }
        }
    }

你有什么想法從 XWPFDocument 中獲取圖紙嗎？

更新：XWPFRun 的 XML 內容。 我試圖提取word文件。 /word/* 目錄下沒有圖片：


<xml-fragment >
   <mc:AlternateContent>
      <mc:Choice Requires="wps">
         <w:drawing>
            <wp:anchor>
               <a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
                  <a:graphicData uri="http://schemas.microsoft.com/office/word/2010/wordprocessingShape">
                     <wps:wsp>
                        <wps:txbx>
                           <w:txbxContent>
                              <w:p w14:paraId="2744738E" w14:textId="0811E43C" w:rsidR="00832A19" w:rsidRDefault="00832A19" w:rsidP="00832A19">
                                 <w:r>
                                    <w:t>Some text here</w:t>
                                 </w:r>
                              </w:p>
                           </w:txbxContent>
                        </wps:txbx>

                     </wps:wsp>
                  </a:graphicData>
               </a:graphic>
            </wp:anchor>
         </w:drawing>
      </mc:Choice>
      <mc:Fallback>
         <w:pict>
            <v:rect w14:anchorId="684D682E" id="Rectangle 2" o:spid="_x0000_s1026" style="" fillcolor="#4f81bd [3204]" strokecolor="#243f60 [1604]" strokeweight="2pt">
               <v:textbox>
                  <w:txbxContent>
                     <w:p w14:paraId="2744738E" w14:textId="0811E43C" w:rsidR="00832A19" w:rsidRDefault="00832A19" w:rsidP="00832A19">
                        <w:r>
                           <w:t>Some text here</w:t>
                        </w:r>
                     </w:p>
                  </w:txbxContent>
               </v:textbox>
            </v:rect>
         </w:pict>
      </mc:Fallback>
   </mc:AlternateContent>
</xml-fragment>

Answer 1

Your provided XML shows, your Word document uses alternate content which was introduced after publishing Office Open XML in 2007. So apache poi does not provide methods to get that content as it only provides methods for Office Open XML according standard ECMA-376 . 那是因為底層的ooxml-schemas只是從那個ECMA-376標准創建的。

所以AlternateContent元素中的drawing元素只能使用XML ( XPath ) 方法直接獲取。

這可能看起來像這樣：

import java.io.FileInputStream;

import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;

import org.apache.xmlbeans.XmlObject;
import org.apache.xmlbeans.XmlCursor;

import java.util.List;
import java.util.ArrayList;

public class WordGetAllDrawingsFromRuns {

 private static List<CTDrawing> getAllDrawings(XWPFRun run) throws Exception {
  CTR ctR = run.getCTR();
  XmlCursor cursor = ctR.newCursor();
  cursor.selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' .//*/w:drawing");
  List<CTDrawing> drawings = new ArrayList<CTDrawing>();
  while (cursor.hasNextSelection()) {
   cursor.toNextSelection();
   XmlObject obj = cursor.getObject();
   CTDrawing drawing = CTDrawing.Factory.parse(obj.newInputStream());
   drawings.add(drawing);
  }
  return drawings;
 }

 public static void main(String[] args) throws Exception {

  XWPFDocument document = new XWPFDocument(new FileInputStream("WordDocument.docx"));

  for (IBodyElement bodyElement : document.getBodyElements()) {
   if (bodyElement instanceof XWPFParagraph) {
    XWPFParagraph paragraph = (XWPFParagraph) bodyElement;
    for(IRunElement runElement : paragraph.getIRuns()) {
     if (runElement instanceof XWPFRun) {
      XWPFRun run = (XWPFRun) runElement;
      List<CTDrawing> drawings = getAllDrawings(run);
      System.out.println(drawings);

     }
    }
   }
  }

  document.close();
 }
}

但下一個問題將是如何從drawing元素中獲取內容，因為<wps:wsp><wps:txbx>根據標准ECMA-376也不是Office Open XML的一部分。 所以 CTDrawing 的CTDrawing ooxml-schemas方法也不能得到這些。 因此，如果需要從繪圖中獲取文本框內容，也只能直接使用XML ( XPath ) 方法。

這可能看起來像這樣：

 private static CTTxbxContent getTextBoxContent(CTDrawing drawing) throws Exception {
  XmlCursor cursor = drawing.newCursor();
  cursor.selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' .//*/w:txbxContent");
  List<CTTxbxContent> txbxContents = new ArrayList<CTTxbxContent>();
  while (cursor.hasNextSelection()) {
   cursor.toNextSelection();
   XmlObject obj = cursor.getObject();
   CTTxbxContent txbxContent = CTTxbxContent.Factory.parse(obj.newInputStream());
   txbxContents.add(txbxContent);
   break;
  }
  CTTxbxContent txbxContent = null;
  if (txbxContents.size() > 0) {
   txbxContent = txbxContents.get(0);
  }
  return txbxContent;
 }

如何從 apache POI XWPFDocument 中獲取圖紙？

問題描述

1 個解決方案

解決方案1
3 已采納 2020-05-06 10:55:39

如何從 apache POI XWPFDocument 中獲取圖紙？

問題描述

1 個解決方案

解決方案1 3 已采納 2020-05-06 10:55:39

解決方案1
3 已采納 2020-05-06 10:55:39