简体   繁体   中英

Extracting hyperlinks from .doc using Apache POI

I am extracting content from Word documents with Apache POI when I ran into this problem.

I was using below code to extract hyperlinks.

 XWPFDocument document = ...
    var it = this.document.getBodyElementsIterator();
    XWPFParagraph para;
    IBodyElement be;
    while(it.hasNext()){
        be=it.next();
        String et = be.getElementType().name();

        System.out.println("element type>>"+et);
        switch (et) {
        case "PARAGRAPH":
            para = (XWPFParagraph) be;
            result.addContent(this.parseParagraph(para));
            break;
        case "TABLE":
......
......

    var runsIt = para.getIRuns().iterator();
    while(runsIt.hasNext()) {

        var irun = runsIt.next();   
        if (irun instanceof XWPFSDT) {
            var fsdt = (XWPFSDT) irun;
            System.out.println("FSDT"+fsdt.toString());
        } else {
            // it is xwpfrun
            var run = (XWPFRun) irun;

            if (irun instanceof XWPFHyperlinkRun) {
                sb.append(extractHyperLink(run));
            }else if(irun instanceof XWPFFieldRun) {
                var fieldRun= (XWPFFieldRun)irun;

                System.out.println("FieldRun:  Instruction>"+fieldRun.getFieldInstruction()+"Text>"+fieldRun.getText(0));
            }
            else {
                sb.append(run);
            } 

This works fine, but then I came across a document where the hyperlinks are not extracted. The XML extract from the relevant section is below:

<w:p w:rsidP="005F1646" w:rsidRDefault="00A20D69" w:rsidR="005F1646">
    <w:pPr>
    <w:r>
        <w:fldChar w:fldCharType="begin"/>
    </w:r>
    <w:r>
        <w:instrText xml:space="preserve"> HYPERLINK "https://stackoverflow.com" </w:instrText>
    </w:r>
    <w:r>
      <w:fldChar w:fldCharType="separate"/>
    </w:r>
    <w:r w:rsidR="005F1646" w:rsidRPr="00D4262C">
       <w:rPr>
          <w:rStyle w:val="Hyperlink"/>
        </w:rPr>
        <w:t>Ask on StackOverFlow</w:t>
    </w:r>
    <w:r>
       <w:rPr>
         <w:rStyle w:val="Hyperlink"/>
       </w:rPr>
       <w:fldChar w:fldCharType="end"/>
     </w:r>
</w:p>

Apache POI does not extract the runs in this paragraph as XWPFHyperlinkRun and my code fails to extract the hyperlink. How can I use Apache POI to extract hyperlink information in this case?

Having faced a similar problem with that, where in my case I had to extract all hyperlinks of my .docx file in order to edit their url , I noticed that I had too a "strangely formatted link" encoded as

<w:r>
    <w:instrText xml:space="preserve"> HYPERLINK "file//....." </w:instrText>
</w:r>

Since I noticed through debugging that this kind of link was an instance of XWPFRun and not XWPFHyperlinkRun , I did the following to "handle it" through the document parsing.

private void traverseToBodyElements(List<IBodyElement> bodyElements, XWPFDocument document) throws Exception {
    for (IBodyElement bodyElement : bodyElements) {
        if (bodyElement instanceof XWPFParagraph) {
            XWPFParagraph paragraph = (XWPFParagraph) bodyElement;
            traverseRunElements(paragraph.getIRuns(), paragraph, document);
        }
    }
}

private void traverseRunElements(List<IRunElement> runElements, XWPFParagraph paragraph, XWPFDocument document) {
   for (int rIndex = 0; rIndex < runElements.size(); rIndex++) {
      IRunElement runElement = runElements.get(rIndex);
      if (runElement instanceof XWPFHyperlinkRun) {
          // handle the hyperlink
      } else if (runElement instanceof XWPFRun) {
          // fix for .doc <w:instrText>HYPERLINK>... hyperlinks extracting
          XWPFRun run = (XWPFRun) runElement;
          CTR ctr = run.getCTR();
          CTText[] ctrInstrTextArray = ctr.getInstrTextArray();
          if (ctrInstrTextArray.length > 0) {
              XmlCursor c = ctr.newCursor();
              c.selectPath("./*");
              while (c.toNextSelection()) {
                  XmlObject o = c.getObject();
                  if (o instanceof CTText) {
                      String tagName = o.getDomNode().getNodeName();
                      if ("w:instrText".equals(tagName)) {
                          Node node = o.getDomNode();
                          int childLength = node.getChildNodes().getLength();
                          for (int nodeIndex = 0; nodeIndex < childLength; nodeIndex++) {
                              Node n = node.getChildNodes().item(nodeIndex);
                              if (n != null) {
                                  if (n.getNodeValue().contains("HYPERLINK")) {
                                      // this is the "strange" formatted hyperlink, do what you want
                                }
                            }
                        }
                    }
                } 
            }
        }
    }
  }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM