简体   繁体   中英

Apache POI library : How to read Excel sheet embedded in Word document

I am using Apache POI library to read a Word document and convert it to HTML. I have a Word document that includes an embedded Excel worksheet. Is there a way to read that embedded Excel sheet when reading the XWPF document?

The OOXML contains below code:

<w:object w:dxaOrig="6942" w:dyaOrig="3234" w14:anchorId="071813E3">
                <v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f">
                  <v:stroke joinstyle="miter"/>
                  <v:formulas>
                    <v:f eqn="if lineDrawn pixelLineWidth 0"/>
                    <v:f eqn="sum @0 1 0"/>
                    <v:f eqn="sum 0 0 @1"/>
                    <v:f eqn="prod @2 1 2"/>
                    <v:f eqn="prod @3 21600 pixelWidth"/>
                    <v:f eqn="prod @3 21600 pixelHeight"/>
                    <v:f eqn="sum @0 0 1"/>
                    <v:f eqn="prod @6 1 2"/>
                    <v:f eqn="prod @7 21600 pixelWidth"/>
                    <v:f eqn="sum @8 21600 0"/>
                    <v:f eqn="prod @7 21600 pixelHeight"/>
                    <v:f eqn="sum @10 21600 0"/>
                  </v:formulas>
                  <v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"/>
                  <o:lock v:ext="edit" aspectratio="t"/>
                </v:shapetype>
                <v:shape id="_x0000_i1037" type="#_x0000_t75" style="width:347.4pt;height:162pt" o:ole="">
                  <v:imagedata r:id="rId7" o:title=""/>
                </v:shape>
                <o:OLEObject Type="Embed" ProgID="Excel.Sheet.12" ShapeID="_x0000_i1037" DrawAspect="Content" ObjectID="_1653752874" r:id="rId8"/>
              </w:object>

I see there is OLEObject embedded in there. But not sure how to read its contents. Any help is greatly appreciated.

The OLEObject s are contained in XWPFRun s. So one could check each XWPFRun whether it contains OLEObject s. If so, then get the rId attribute out of the OLEObject . This ID links to a document part of the Office Open XML document. The content type of the package part behind that document part determines what kind of object is embedded. So dependent of the content type one could get the XSSFWorkbook , HSSFWorkbook or other embedded OLEObject s then.

The following methods demonstrate this approach:

...
import org.apache.poi.ooxml.*;
import org.apache.poi.openxml4j.opc.*;
import org.apache.poi.xwpf.usermodel.*;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.*;
import org.apache.poi.hssf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;
import org.apache.xmlbeans.XmlObject;
...

 void handleOLEObjects(XWPFRun run) {
  CTR ctr = run.getCTR();
  String declareNameSpaces = "declare namespace o='urn:schemas-microsoft-com:office:office'";
  XmlObject[] oleObjects = ctr.selectPath(declareNameSpaces + ".//o:OLEObject");
  for (XmlObject oleObject : oleObjects) {
   XmlObject rIdAttribute = oleObject.selectAttribute("http://schemas.openxmlformats.org/officeDocument/2006/relationships", "id");
   if (rIdAttribute != null) {
    String rId = rIdAttribute.newCursor().getTextValue();
    handleOLEObject(run.getDocument(), rId);
   }
  }
 }

 void handleOLEObject(XWPFDocument document, String rId) {
  POIXMLDocumentPart documentPart = document.getRelationById(rId);
  if ("application/vnd.openxmlformats-officedocument.spreadsheetml.sheet".equals(documentPart.getPackagePart().getContentType())) {
   handleXSSFWorkbook(documentPart.getPackagePart());
  } else if ("application/vnd.ms-excel".equals(documentPart.getPackagePart().getContentType())) {
   handleHSSFWorkbook(documentPart.getPackagePart());
  } //else if ...
 }

 void handleXSSFWorkbook(PackagePart part) {
  try {
   XSSFWorkbook workbook = new XSSFWorkbook(part);
   for (Sheet sheet : workbook) {
    for (Row row : sheet) {
     for (Cell cell : row) {
      System.out.print(cell + "\t");
     }
     System.out.println();
    }
   }
  } catch (Exception ex) {
   ex.printStackTrace();
  }
 }

 void handleHSSFWorkbook(PackagePart part) {
  try {
   HSSFWorkbook workbook = new HSSFWorkbook(part.getInputStream());
   for (Sheet sheet : workbook) {
    for (Row row : sheet) {
     for (Cell cell : row) {
      System.out.print(cell + "\t");
     }
     System.out.println();
    }
   }
  } catch (Exception ex) {
   ex.printStackTrace();
  }
 }

The method handleOLEObjects uses XPath to get all the OLEObject XML objects out of the XWPFRun . It also gets the rId attribute. If a such is present, then it calls handleOLEObject . This method gets the linked POIXMLDocumentPart from XWPFDocument by the rId . Then it determines by the content types which kinds of OLEObjects are embedded and calls different handler methods for those objects.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM