如何使用Apache POI从.doc文档中提取文本？

Question

我使用下面的一些代码片段从.doc文件中提取文本

HWPFDocument document = new HWPFDocument(new FileInputStream(inputFile));
Range range = document.getRange();
        int len = range.numParagraphs();
        StringBuilder builder = new StringBuilder();

        for (int i = 0; i < len; i++) {
            builder.append(range.getParagraph(i).text());
        }

和

HWPFDocument document = new HWPFDocument(new FileInputStream(inputFile));
WordExtractor wordExtractor = new WordExtractor(document);
        String[] paragraphs = wordExtractor.getParagraphText();
        StringBuilder builder = new StringBuilder();
        for (String p : paragraphs) {
            builder.append(p);
        }

但是，他们两个总是输出一些奇怪的字符。 例如： ?PAGEREF_Toc351848910\\h10?HYPERLINK\\l _Toc351848911 ?CITATIONPla\\l1033[?HYPERLINK\\l"Pla"13] 。 因此，我想知道它们来自哪里以及从.doc文件中提取文本时如何删除它们

提前致谢

Answer 1

我希望这可以给您一些见识。

    private static void ConvertDoctoPdf(String src, String outputPdf) throws Exception {

        try {
            Document pdfdoc = new Document();

            HWPFDocument doc = new HWPFDocument(new FileInputStream(src));

            //create wordextractor object to wrap the extracted word from HWPFDocument object.
            WordExtractor we = new WordExtractor(doc);

            OutputStream outputFile = new FileOutputStream(new File(desc));

            //create a pdf writer object to write text to mypdf.pdf file
            PdfWriter.getInstance(pdfdoc, outputFile);

            pdfdoc.open();

            Paragraph para = new Paragraph();

            //Collecting all paragraphs
            String[] paragraphs = we.getParagraphText();

            for (int i = 0; i < paragraphs.length; i++) {
                //add the paragraph to the document
                para.add(paragraphs[i]);
                //para.add(new Chunk(Chunk.NEWLINE));
                }
            //print all paragraph together
            System.out.println(para);    
            //Add all paragraph together to pdfdoc document.
            pdfdoc.add(para);

            pdfdoc.close();
            we.close();
            }  catch (Exception e) {
            e.printStackTrace();

        }
    }

如何使用Apache POI从.doc文档中提取文本？

问题描述

1 个解决方案

解决方案1
0 2017-02-16 10:31:58

如何使用Apache POI从.doc文档中提取文本？

问题描述

1 个解决方案

解决方案1 0 2017-02-16 10:31:58

解决方案1
0 2017-02-16 10:31:58