I am extracting content from Word documents with Apache POI when I ran into this problem.
I was using below code to extract hyperlinks.
XWPFDocument document = ...
var it = this.document.getBodyElementsIterator();
XWPFParagraph para;
IBodyElement be;
while(it.hasNext()){
be=it.next();
String et = be.getElementType().name();
System.out.println("element type>>"+et);
switch (et) {
case "PARAGRAPH":
para = (XWPFParagraph) be;
result.addContent(this.parseParagraph(para));
break;
case "TABLE":
......
......
var runsIt = para.getIRuns().iterator();
while(runsIt.hasNext()) {
var irun = runsIt.next();
if (irun instanceof XWPFSDT) {
var fsdt = (XWPFSDT) irun;
System.out.println("FSDT"+fsdt.toString());
} else {
// it is xwpfrun
var run = (XWPFRun) irun;
if (irun instanceof XWPFHyperlinkRun) {
sb.append(extractHyperLink(run));
}else if(irun instanceof XWPFFieldRun) {
var fieldRun= (XWPFFieldRun)irun;
System.out.println("FieldRun: Instruction>"+fieldRun.getFieldInstruction()+"Text>"+fieldRun.getText(0));
}
else {
sb.append(run);
}
This works fine, but then I came across a document where the hyperlinks are not extracted. The XML extract from the relevant section is below:
<w:p w:rsidP="005F1646" w:rsidRDefault="00A20D69" w:rsidR="005F1646">
<w:pPr>
<w:r>
<w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r>
<w:instrText xml:space="preserve"> HYPERLINK "https://stackoverflow.com" </w:instrText>
</w:r>
<w:r>
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r w:rsidR="005F1646" w:rsidRPr="00D4262C">
<w:rPr>
<w:rStyle w:val="Hyperlink"/>
</w:rPr>
<w:t>Ask on StackOverFlow</w:t>
</w:r>
<w:r>
<w:rPr>
<w:rStyle w:val="Hyperlink"/>
</w:rPr>
<w:fldChar w:fldCharType="end"/>
</w:r>
</w:p>
Apache POI does not extract the runs in this paragraph as XWPFHyperlinkRun and my code fails to extract the hyperlink. How can I use Apache POI to extract hyperlink information in this case?
Having faced a similar problem with that, where in my case I had to extract all hyperlinks
of my .docx file in order to edit their url , I noticed that I had too a "strangely formatted link" encoded as
<w:r>
<w:instrText xml:space="preserve"> HYPERLINK "file//....." </w:instrText>
</w:r>
Since I noticed through debugging that this kind of link was an instance of XWPFRun
and not XWPFHyperlinkRun
, I did the following to "handle it" through the document parsing.
private void traverseToBodyElements(List<IBodyElement> bodyElements, XWPFDocument document) throws Exception {
for (IBodyElement bodyElement : bodyElements) {
if (bodyElement instanceof XWPFParagraph) {
XWPFParagraph paragraph = (XWPFParagraph) bodyElement;
traverseRunElements(paragraph.getIRuns(), paragraph, document);
}
}
}
private void traverseRunElements(List<IRunElement> runElements, XWPFParagraph paragraph, XWPFDocument document) {
for (int rIndex = 0; rIndex < runElements.size(); rIndex++) {
IRunElement runElement = runElements.get(rIndex);
if (runElement instanceof XWPFHyperlinkRun) {
// handle the hyperlink
} else if (runElement instanceof XWPFRun) {
// fix for .doc <w:instrText>HYPERLINK>... hyperlinks extracting
XWPFRun run = (XWPFRun) runElement;
CTR ctr = run.getCTR();
CTText[] ctrInstrTextArray = ctr.getInstrTextArray();
if (ctrInstrTextArray.length > 0) {
XmlCursor c = ctr.newCursor();
c.selectPath("./*");
while (c.toNextSelection()) {
XmlObject o = c.getObject();
if (o instanceof CTText) {
String tagName = o.getDomNode().getNodeName();
if ("w:instrText".equals(tagName)) {
Node node = o.getDomNode();
int childLength = node.getChildNodes().getLength();
for (int nodeIndex = 0; nodeIndex < childLength; nodeIndex++) {
Node n = node.getChildNodes().item(nodeIndex);
if (n != null) {
if (n.getNodeValue().contains("HYPERLINK")) {
// this is the "strange" formatted hyperlink, do what you want
}
}
}
}
}
}
}
}
}
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.