简体   繁体   中英

How to extract text from .doc document using apache poi?

I used some code snippets below for extracting text from .doc file

HWPFDocument document = new HWPFDocument(new FileInputStream(inputFile));
Range range = document.getRange();
        int len = range.numParagraphs();
        StringBuilder builder = new StringBuilder();

        for (int i = 0; i < len; i++) {


HWPFDocument document = new HWPFDocument(new FileInputStream(inputFile));
WordExtractor wordExtractor = new WordExtractor(document);
        String[] paragraphs = wordExtractor.getParagraphText();
        StringBuilder builder = new StringBuilder();
        for (String p : paragraphs) {

However, both of them always output some strange characters. ex: ?PAGEREF_Toc351848910\\h10?HYPERLINK\\l _Toc351848911 ?CITATIONPla\\l1033[?HYPERLINK\\l"Pla"13] . So, I want to know where are they from and how to remove them when extracting text from .doc file

Thanks in advance

I hope this may give you some insight.

    private static void ConvertDoctoPdf(String src, String outputPdf) throws Exception {

        try {
            Document pdfdoc = new Document();

            HWPFDocument doc = new HWPFDocument(new FileInputStream(src));

            //create wordextractor object to wrap the extracted word from HWPFDocument object.
            WordExtractor we = new WordExtractor(doc);

            OutputStream outputFile = new FileOutputStream(new File(desc));

            //create a pdf writer object to write text to mypdf.pdf file
            PdfWriter.getInstance(pdfdoc, outputFile);


            Paragraph para = new Paragraph();

            //Collecting all paragraphs
            String[] paragraphs = we.getParagraphText();

            for (int i = 0; i < paragraphs.length; i++) {
                //add the paragraph to the document
                //para.add(new Chunk(Chunk.NEWLINE));
            //print all paragraph together
            //Add all paragraph together to pdfdoc document.

            }  catch (Exception e) {


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM