简体   繁体   中英

bidi string can't be read from Word (Apache POI)

I'm writing a bidi String to an MS Word file using Apache POI after wrapping it with the sequence aString = "\‮" + aString + "\‬"; The text renders correctly in the file, and reads fine when I retrieve the string again. But if I modify the file in anyway, suddenly, reading that string returns true with isBlank(). Thank you in advance for any suggestions/help!

When Microsoft Word stores bidirectional text in it's Office Open XML *.docx format, then it sometimes uses special text run elements w:bdo ( b i d irectional orientation ). Apache poi does not read those elements until now. So if a XWPFParagraph contains such elements, then paragraph.getText() will return an empty string.

One could using org.apache.xmlbeans.XmlCursor to really get all text from all XWPFParagraph s like so:

import java.io.FileInputStream;

import org.apache.poi.xwpf.usermodel.*;

import org.apache.xmlbeans.XmlCursor;

public class ReadWordParagraphs {
    
 static String getAllTextFromParagraph(XWPFParagraph paragraph) {
  XmlCursor cursor =  paragraph.getCTP().newCursor();
  return cursor.getTextValue();
 }

 public static void main(String[] args) throws Exception {

  XWPFDocument document = new XWPFDocument(new FileInputStream("WordDocument.docx"));
  
  for (XWPFParagraph paragraph : document.getParagraphs()) {
   System.out.println(paragraph.getText()); // will not return text in w:bdo elements
   System.out.println(getAllTextFromParagraph(paragraph)); // will return all text content of paragraph
  }
 }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM