![](/img/trans.png)
[英]How can I extract right-to-left text from .doc and .docx files using Apache POI in java?
[英]How to extract text from .doc document using apache poi?
我使用下面的一些代码片段从.doc文件中提取文本
HWPFDocument document = new HWPFDocument(new FileInputStream(inputFile));
Range range = document.getRange();
int len = range.numParagraphs();
StringBuilder builder = new StringBuilder();
for (int i = 0; i < len; i++) {
builder.append(range.getParagraph(i).text());
}
和
HWPFDocument document = new HWPFDocument(new FileInputStream(inputFile));
WordExtractor wordExtractor = new WordExtractor(document);
String[] paragraphs = wordExtractor.getParagraphText();
StringBuilder builder = new StringBuilder();
for (String p : paragraphs) {
builder.append(p);
}
但是,他们两个总是输出一些奇怪的字符。 例如: ?PAGEREF_Toc351848910\\h10?HYPERLINK\\l
_Toc351848911
?CITATIONPla\\l1033[?HYPERLINK\\l"Pla"13]
。 因此,我想知道它们来自哪里以及从.doc文件中提取文本时如何删除它们
提前致谢
我希望这可以给您一些见识。
private static void ConvertDoctoPdf(String src, String outputPdf) throws Exception {
try {
Document pdfdoc = new Document();
HWPFDocument doc = new HWPFDocument(new FileInputStream(src));
//create wordextractor object to wrap the extracted word from HWPFDocument object.
WordExtractor we = new WordExtractor(doc);
OutputStream outputFile = new FileOutputStream(new File(desc));
//create a pdf writer object to write text to mypdf.pdf file
PdfWriter.getInstance(pdfdoc, outputFile);
pdfdoc.open();
Paragraph para = new Paragraph();
//Collecting all paragraphs
String[] paragraphs = we.getParagraphText();
for (int i = 0; i < paragraphs.length; i++) {
//add the paragraph to the document
para.add(paragraphs[i]);
//para.add(new Chunk(Chunk.NEWLINE));
}
//print all paragraph together
System.out.println(para);
//Add all paragraph together to pdfdoc document.
pdfdoc.add(para);
pdfdoc.close();
we.close();
} catch (Exception e) {
e.printStackTrace();
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.